Vous êtes sur la page 1sur 119

EMPOWERING SCIENTIFIC DISCOVERY BY DISTRIBUTED DATA MINING ON A GRID INFRASTRUCTURE

A PROPOSAL FOR DOCTORAL RESEARCH

by Haimonti Dutta

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY AT UNIVERSITY OF MARYLAND BALTIMORE COUNTY 1000 HILLTOP CIRCLE, BALTIMORE, MD, 21250 JULY 2006

Table of Contents
Table of Contents Abstract 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Proposed Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Background 2.1 The Grid . . . . . . . . . . . . . . . . . . . 2.1.1 Introduction . . . . . . . . . . . . . 2.1.2 The Grid Architecture . . . . . . . 2.1.3 Classication of Grids . . . . . . . 2.2 The Data Grid . . . . . . . . . . . . . . . . 2.2.1 Introduction . . . . . . . . . . . . . 2.2.2 Data Distribution Scenarios . . . . 2.2.3 Middleware, Protocols and Services 2.2.4 Data Mining on the Grid . . . . . . 2.3 Distributed Data Mining . . . . . . . . . . 2.3.1 Introduction . . . . . . . . . . . . . 2.3.2 Classication . . . . . . . . . . . . 2.3.3 Clustering . . . . . . . . . . . . . . 2.3.4 Distributed Data Stream Mining . . 2.4 The Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i iii 1 1 2 3 5 5 5 6 8 9 9 11 11 14 23 23 24 29 33 34 39 39 40 41 44 48 53 55

3 Preliminary Work 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Orthogonal Decision Trees . . . . . . . . . . . . . . . . . . . . 3.2.1 Decision Trees and the Fourier Representation . . . . . 3.2.2 Computing the Fourier Transform of a Decision Tree . . 3.2.3 Construction of a Decision Tree from Fourier Spectrum 3.2.4 Removing Redundancies from Ensembles . . . . . . . 3.2.5 Experimental Results . . . . . . . . . . . . . . . . . . .

3.3

3.4

DDM on Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Monitoring in Resource Constrained Environments . . . . . . 3.3.4 Grid Based Physiological Data Stream Monitoring - A Dream or Reality ? . . . . . . . . . . . . . . . . . . . . . . . . . . . DDM on Federated Databases . . . . . . . . . . . . . . . . . . . . . 3.4.1 The National Virtual Observatory . . . . . . . . . . . . . . . 3.4.2 Data Analysis Problem: Analyzing Distributed Virtual Catalogs 3.4.3 The DEMAC system . . . . . . . . . . . . . . . . . . . . . . 3.4.4 WS-DDM DDM for Heterogeneously Distributed Sky-Surveys 3.4.5 WS-CM Cross-Matching for Heterogeneously Distributed Sky-Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.6 DDM Algorithms: Denitions and Notation . . . . . . . . . . 3.4.7 Virtual Catalog Principal Component Analysis . . . . . . . . 3.4.8 Case Study: Finding Galactic Fundamental Planes . . . . . . 3.4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65 65 69 74 75 75 75 77 78 78 79 79 80 82 88 91 91 91 92 93 95

4 Future Work 4.1 The DEMAC system - further explorations . . . . 4.1.1 Grid-enabling DEMAC . . . . . . . . . . 4.1.2 PCA based Outlier Detection on DEMAC 4.2 Proposed Plan of Research . . . . . . . . . . . . Bibliography

ii

Abstract
The grid-based computing paradigm has attracted much attention in recent years. The sharing of distributed computing resources (such as software, hardware, data, sensors, etc) is an important aspect of grid computing. Computational Grids focus on methods for handling compute intensive tasks while Data Grids are geared towards data-intensive computing. Grid-based computing has been put to use in several application areas including astronomy, chemistry, engineering, climate studies, geology, oceanography, ecology, physics, biology, health sciences and computer science. For example, in the eld of biomedical informatics, researchers are building an infrastructure of networked high-performance computers, data integration standards, and other emerging technologies, to pave the way for medical researchers to transform the way diseases are being treated. In Oceanography, efforts are being made to federate ocean observatories into an integrated knowledge grid. Breakthroughs in telescope, detector, and computer technology allow astronomical surveys to produce terabytes of images and catalogs, thereby producing a data avalanche. However, extracting meaningful knowledge from these gigantic, geographically distributed, heterogeneous data repositories requires development of architectures, sophisticated data mining algorithms and efcient schemes for communication. This proposal considers research in grid-based distributed data mining. It aims to bring together the relatively new research areas of distributed data mining and grid mining. While architectures for data mining on the grid have already been proposed, we argue that the inherently distributed, heterogeneous nature of the grid, calls for distributed data mining. Consequently, research should be geared towards development of distributed schema integration, query processing, algorithm development and workow management. As a proof of concept, we rst explore the feasibility of executing distributed data mining algorithms on astronomy catalogs obtained from two different sky surveys Sloan Digital Sky Survey (SDSS) and The Two Micron All Sky Survey (2MASS). In particular, we examine a technique for cross-matching indices of different catalogs thereby aligning them, use a randomized distributed algorithm for principal component analysis and propose to develop an outlier detection algorithm based on a similar technique. While this serves as a proof of concept, efforts are in way to grid-enable the application. This requires research on service-oriented architectural paradigms to support distributed data mining on the grid. The data repositories ported on the grid are not all static. Streaming data from web click streams, network intrusion detection applications, sensor networks, wearable devices, multimedia applications are also nding their way into the grid. This is particularly useful since in this way, researchers do not need to set up or own mobile devices, expensive equipment such as telescopes and satellites but can access interesting data streams published on the grid. However, in order to discover meaningful knowledge from these distributed, heterogeneous streams, efforts have to be made to build new architectures and algorithms to support distributed data streams. We propose to address these issues to enable data stream mining on the grid.

iii

Chapter 1

Introduction
1.1 Motivation
Advances in science has been guided by analysis of data. For example, huge genome sequences [132] available online motivates collaborative research in biology, catalogs of sky surveys [242, 3] enables astronomers to answer queries that may have taken years of observation, high-resolution, long-duration simulation data from experiments and models enables research in climatology, physics, geosciences and chemistry [197, 115, 225], and advanced imaging capabilities such as Magnetic Resonance Imaging (MRI), Computed Tomography (CT) scans produce large volumes of data for medical professionals [29]. However, as pointed out by Ian Foster, converting data into scientic discoveries requires "connecting data with people and computers". It involves: 1. Finding the data of interest. 2. Moving the data to desired locations. 3. Managing large scale computations 4. Scheduling resources on data and 5. Managing who can access the data when. For example, the main goal of CERN, the European Organization for Nuclear Research in Geneva, Switzerland is to study the fundamental structure of matter and the interaction of forces. In particular, subatomic particles are accelerated to nearly the speed of light and then collided. Such collisions are called events and are measured at time intervals of only 25 nanoseconds in four different particle detectors of the Large Hadron Collider (LHC) CERNs next generation accelerator which has started data collection in 2006. According to the MONARC Project1 each of the 4 main experiments will produce around 1 Petabyte of data a year over a life span of about two
1 MONARC: Models of Networked Analysis at Regional Centers for LHC experiments, http://monarc.web.cern.ch/MONARC/

decades. This data needs to be analysed by about 5,000 physicists around the world. Since CERN experiments are collaborations of over a thousand physicists from many dif ferent universities and institutes, the experiments data is not only stored locally at CERN but is distributed world wide in so called Regional Centres (RCs), in national institutes and universities. Thus, complex distributed computing infrastructures motivate the need for Grid environments. To extract meaningful information from distributed, heterogeneous data repositories on the grid, sophisticated knowledge discovery architectures have to be designed. The process of data mining on the grid, is still a relatively new area of research ([40, 43, 46, 37, 220, 264, 233]). While several architectures have been developed for this purpose, the framework of distributed data mining on the grid infrastructure still has a long way to go. The aim of this proposal is to motivate research in this direction.

1.2 Proposed Research


There has been a growing interest in grid computing in recent years (see section 2.1). Grids can be classied into two main categories: (1) Computational Grids designed to meet the increasing demand of compute intensive science and (2) Data Grids designed to meet the needs of data intensive applications. In this proposal, the primary focus is on Data Grids. The objective of setting up a Data Grid is to encapsulate the underlying mechanisms of storage, querying and transfer of data. Thus an user of the grid does not need to bother about the underlying mechanism of data storage, authentication, authorization, resource management and security but can still have the benets of large scale distributed computing. Several protocols, services and middleware architectures have been proposed for storage, integration and querying of data on the grid. Of particular interest is the Open Grid Service Architecture - Data Access and Integration (OGSA -DAI) [205] which was conceived by the UK Database Task force and works closely with Database Access and Integration Service - Working Group (DAIS-WG) of the Global Grid Forum (GGF) and the Globus team. Their aim is to develop a service based architecture for data access and integration. Several other projects such as Knowledge Grid [40], Grid Miner [220], Discovery Net [264], TeraGrid [257], ADaM (Algorithm Development and Mining) [233] on NASAs Information Power Grid, and the DataCutter project [191] have focused on the creation of middleware / systems for data mining and knowledge discovery on top of the Data Grid. Motivated by this research, we propose to develop service based architectures for distributed data mining on the grid infrastructure. We have developed a system for distributed data mining on astronomy catalogs (see section 3.4) using the resources from the National Virtual Observatory. The system demonstrates how distributed data mining algorithms can be designed on top of the heterogeneous astronomy catalogs without they being downloaded onto a centralized server. In particular we examine a randomized algorithm for distributed principal component analysis and provide experimental results to show that this algorithm replicates results obtained in the centralized setting at a lower communication cost. Encouraged by these results we also plan to develop a distributed outlier detection algorithm for astronomy catalogs. We also propose to develop a service based architecture for distributed stream min2

ing on the grid. Distributed data streams obtained from network intrusion detection applications, sensor networks, vehicle monitoring systems and web click streams are being ported onto the grid. Mining these inherently distributed heterogeneous streams on the grid requires development of new architectures and algorithms since existing architectures such as those described in section 2.2.4 may not be suited for streaming data. Thus the overall focus of our attention is on developing a synergy between distributed data mining and the grid-based data mining. We propose to develop service based architectures for grid-based distributed data mining relying on application scenarios from astronomy.

1.3 Objectives
The main objectives of the proposed research are as follows: 1. Develop a service-oriented architecture for enabling distributed data mining on the Grid. This includes (a) Development of services for distributed schema integration, integration of indices and query processing. (b) Development of distributed workows for service composition. 2. Develop a prototype system for implementing the above architecture. The application area is astrophysics and the objectives of building the system are as follows: (a) Access and integrate federated astronomy databases using the Open Grid Services Architecture - Data Access and Integration (OGSA-DAI) [205] middleware as a starting point. Extension of OGSA-DAI to incorporate schema integration, workow composition are future objectives. (b) Perform distributed data mining on these repositories including dimension reduction by Principal Component Analysis (PCA), classication and outlier detection. (c) Provide a client side browsing that enables astrophysicists to perform distributed data mining on the federated databases without having to intricately manage resource allocation, authorization and authentication and communication of grid resources. 3. Develop a service-oriented architecture for distributed data stream mining on the Grid. The remainder of this proposal is organized as follows. Chapter 2 offers an overview of the grid infrastructure with emphasis on the Data Grid, existing architectures for data mining, integration of streams on the grid and identifying the challenges for distributed data mining on the grid. Chapter 3 presents our preliminary work on distributed classication using Orthogonal Decision Trees (ODTs) and shows the applicability of ODTs 3

in streaming resource-constrained devices. It also presents a feasibility study for distributed scientic data mining on astronomy catalogs. Chapter 4 outlines the directions for future research.

Chapter 2

Background
2.1 The Grid
2.1.1 Introduction
The science of the 21st century requires large amounts of computation power, storage capacity and high speed communication [124, 99]. These requirements are increasing at an exponential rate and scientists are demanding much more than is available today. Several astronomy and physical science projects such as CERNs 1 Large Hadron Collider (LHC) [170], Sloan Digital Sky Survey (SDSS) [242], The Two Micron All Sky Survey (2MASS)[3], bioinformatics projects including the Human Genome Project [132], gene and protein archives [216, 251], meteorological and environmental surveys [197, 239] are already producing peta and tera bytes of data which requires to be stored, analyzed, queried and transferred to other sites. To work with collaborators at different geographical locations on peta scale data sets, researchers require communication of the order of Gigabits / sec. Thus computing resources are failing to keep up with the challenges they face. The concept of the "Grid" has been envisioned to provide a solution to these increasing demands and offer a shared, distributed computing infrastructure. In an early article [99] that motivates the need for Grid computing, Ian Foster describes the Grid "vision" "...to put in place a new international scientic infrastructure with tools that, together, can meet the challenging demands of 21st-century science." Today, much of this dream has become reality with numerous research projects working on different aspects of grid computing including development of the core technologies, deployment and application of grid technology to different scientic domains 2. In the following sections, we briey review the grid architecture and provide a classication of different types of grids. It must be noted that the objective of this proposal is not to provide a detailed overview of grid computing and related issues, but
Europen pour la Recherche Nuclaire - European Organization for Nuclear Research list of applications in different scientic domains using the grid technology can be found at http://www.globus.org/alliance/publications/papers.php Applications
1 Conseil 2A

to introduce the concept of mining and knowledge discovery on a data grid (introduced later in section 2.2.1). Consequently, a reader interested in grid computing should refer to [101] for a detailed overview.

2.1.2 The Grid Architecture

Figure 2.1: The hour glass model The sharing of distributed computing resources including software, hardware, data, sensors, etc is an important aspect of grid computing. Sharing can be dynamic depending on the current need, may not be limited to client server architectures and the same resources can be used in different ways depending on the objective of sharing. These characteristics and requirements for resource sharing necessitate the formation of Virtual Organizations (VOs) [135]. Thus "VOs enable disparate groups of organizations and / or individuals to share resources in a controlled fashion, so that members may collaborate to achieve a shared goal." [135] An example of a virtual organization is the International Virtual Data Grid Laboratory (iVDGL) [140], an NSF funded project that aims to share computing resources for experiments in high energy physics [170], gravitational wave searches (LIGO) [173] and astronomy [242]. The architecture for grid computing, henceforth referred to as grid architecture is a protocol architecture that outlines how the users of a virtual organization interact with 6

Figure 2.2: The Grid Protocol Architecture

one another for resource sharing. Proposed by Ian Foster and Carl Kesselman [135], the grid architecture follows the principles of an "hourglass model". The Figure 2.1, obtained from [99] illustrates this architecture. The "narrow neck" of the hourglass denes the core set of protocols and abstractions, the top contains the high level behaviors and the base contains the underlying infrastructure. The Grid protocol architecture comprises of several layers (illustrated in Figure 2.2) including the Fabric layer responsible for local, resource specic operations, the Connectivity layer which performs the network connections, the Resource layer containing protocols for sharing single resources, the Collective layer for co-ordination among underlying resources and the Applications layer containing user applications. These layers provide the basic protocols and services that are necessary for sharing of resources among different groups in a virtual organization. Thus if the user application is a data mining scenario, the fabric layer would contain participating computers with data repositories, the connectivity layer comprises of the service discovery, authorization and authentication services and communication, the resource layer contains the access to computation and data, the collective layer contains the resource discovery, system monitoring and other application specic requirements. A further enhancement to the protocol based grid architecture, was the Open Grid Services Architecture (OGSA), proposed in [134]. OGSA introduces the concept of Grid Services which can be regarded as specialized web services that contain interfaces for discovery, dynamic service creation, lifetime management, etc and conform to the Web Services Description Language (WSDL) specications. Various VO structures can be congured using the grid services interfaces for creation, registration and discovery. Thus, the use of grid services provides a way to virtualize components in a grid environment and ensures abstraction across different layers of the architecture. The implementation of the OGSA architecture can be found in the current release of 7

the Globus Toolkit 4.03 [133]. In this section we briey reviewed the grid protocol architecture and the web services based Open Grid Services Architecture (OGSA). The following subsection discusses methods to classify grids.

2.1.3 Classication of Grids


Classication of grids may be done based on different criteria such as the kind of services provided, the class of problems they address or the community of users [127]. However, a common method of discrimination depends on whether they offer computational power (Computational Grids) or data storage (Data Grids). 1. Computational Grid: The computational grid has been dened as ".. a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities." [100] They are designed to meet the increased need of computational power by large scale pooling of resources such as compute cycles, idle CPU times between machines, software services etc. An example where a computational grid could be put to use is a health maintenance organization of a metropolitan area, requiring collaboration between medical personnel, patients, health insurance representatives, nancial experts and administrative personnel. The resources to be shared include high end compute servers, hundreds of workstations, patient databases, medical imaging archives and medical instrumentation (such as Computed Tomography (CT) scan, Medical Resonance Imaging (MRI),ElectroCardioGram (ECG), Ultrasonography equipment). The formation of a computational grid enables computer aided diagnosis by utilizing information from different medical disciplines, facilitates cross domain medical research, searching of imaging archives, enhanced recommendation schemes for health insurance facilities, detection of fraud on nancial data (such as hospital bills, insurance claims etc). 2. Data Grid: It is primarily geared towards management of data intensive applications [64, 17] and focuses on the synthesis of knowledge discovered from geographically distributed data repositories, digital libraries and archives. An example of a data grid would be a collaboration of astronomy sky surveys such as Sloan Digital Sky Survey [242], Two Micron All Sky Survey [3] which are producing large volumes of astronomical data. The purpose of formation of a data grid would be to enhance astronomy and astrophysical research making use of distributed data mining and knowledge discovery techniques. In this proposal we are mainly concerned with data grids and focus on how efcient distributed algorithms can be designed on top of them. The next section offers an overview of the architecture of the data grid, some already existing data grids, and efforts made towards implementation of data mining and knowledge discovery services on the data grid.
3 http://www.globus.org/toolkit/

2.2 The Data Grid


2.2.1 Introduction
In many scientic domains such as astronomy, high energy physics, climatology, computational genomics, medicine and engineering large repositories of geographically distributed data are being generated ([265], [13], [14], [267], [192], [136], [228]). Researchers needing to access the data may come from different communities and are often geographically distributed. The use of these repositories as community resources has motivated the need for developing an infrastructure capable of having storage and replica management facilities, efcient query execution techniques, data transfer schemes, caching and networking. The Data Grid [64] has emerged to provide an architecture for distributed management and storage of large scientic data sets. The objective of such an architecture is 1. To provide a framework such that the low level mechanisms of storage, transfer of data etc are well encapsulated 2. Design issues that can lead to signicant performance implications are allowed to be manipulated by an user 3. To be compatible with a Grid infrastructure and benet from the grids facilities of authentication, resource management, security and uniform information infrastructure.

Figure 2.3: The Data Grid Architecture The Figure 2.34 illustrates the basic components of the Data Grid as envisioned by Chevranek et all. The core grid services are utilized to provide basic mechanisms
4 This

gure has been adapted from [64] with slight modications.

of security, resource management etc. The high level components (such as replica management, replica selection) are allowed to be built on top of these basic grid components. Their work assumes the data access and meta-data access as the fundamental services necessary for a data grid architecture. The data access handles issues related to accessing, managing, transfer of data to third parties etc. The metadata access services are explicitly concerned with the handling of information regarding the data such as information related to how the data was created, how to use it and how le instances can be mapped to storage locations. Other basic grid services that can be incorporated into a data grid framework include an authorization and authentication infrastructure (such as Grid Security Infrastructure), resource allocation schemes and performance management issues. It must be noted that any number of high level components can be designed by the user, using the afore mentioned basic grid services. Several projects such as GriPhyN (Grid Physics Network [115]), European Data Grid Project ([93]) have already implemented the data grid architecture. In order to harness the petascale data resources obtained from four data intensive physics experiments (ATLAS and CMS [170], LIGO [173], SDSS [242]), the GriPhyN project conceptualized the idea of Petascale Virtual Data Grids (PVDGs) and Virtual Data. Petascale Virtual Data Grids[18, 17] are aimed at serving a diverse community of scientists and researchers, enabling them to retrieve, modify and perform experiments and analyses on the data. The idea of virtual data revolves around creation of a virtual space of data products derived from experimental data. The European Data Grid Project is also motivated by the needs of High Energy Physics, Earth Observation and BioInformatics research communities who need to store, access and process large volumes of the data. The architecture [275, 130, 75, 4] of the data grid proposed, is modular in nature, and has similar characteristics to the GriPhyN project. The subsystems of the architecture include (in order from bottom to top) Fabric Services, Grid Services, Collective Services, Grid Applications and Local Computing modules. Management of workload distribution, resource sharing and management, monitoring fault tolerance, providing an interface between grid services and underlying storage are some of the functionalities handled by the Fabric Services. The Grid Services typically comprise of the SQL databases services, authentication and authorization, replica management and service indices5 . The grid service schedulers, replica managers form the bulk of the Collective Services, while application specic services are handled in the Grid Applications layer. The Local Computing layer resides outside the Grid infrastructure, and are typically the desktop machines from where end users may access the data grid. In this section we discussed the motivation, features and components of the data grid architecture and two projects GriPhyN and European Data Grid Project that have implemented this architecture. At this point it will be interesting to discuss some of the data distribution scenarios that are commonly seen in the data grid. The next subsection introduces this topic.
5 The services that allow a large number of decentralized grid components to collaboratively work in virtual data environments.

10

2.2.2 Data Distribution Scenarios


In a data grid, the repositories may contain data in different formats. We discuss several different data distribution schemes here6 . 1. Centralized Data Source: This is one of the simplest scenarios since the data can be thought of as residing in a single relational database, a at le, or as unstructured data (XML). Grid / Web services needing to access this data source can do so by using metadata to obtain the physical data locations and then make use of relevant query languages. 2. Distributed Data Source: When the data is assumed to be distributed among different sites, two different scenarios can arise. (a) Horizontally Partitioned Data: The horizontal partitioning ensures that each site contains exactly the same set of attributes. Note that we refer to the data as horizontally partitioned with respect to a virtual global table. An example of horizontally partitioned data could be a departmental store such as Walmart which has shops at different geographical locations. Each shop maintains information about its customers such as name, address, telephone number and products purchased. Although the shops are geographically distributed, each database keeps track of exactly the same information about its customers. (b) Vertically Partitioned Data: The vertical partitioning requires that different attributes are observed at different sites. The matching between the tuples can be determined using a unique identication or key that is shared among all the sites. An example of vertically partitioned data can be different sky surveys such as SDSS[242], 2-MASS[3], Deep[77], CfA[51] all observing different attributes of the same objects seen in the sky. In either case, horizontal or vertical partitioning of data, grid services can provide a level of abstraction and encapsulation so that the user is not burdened with writing custom code to access the data. Due to the different distribution schemes available, porting databases to the Grid requires development of new protocols and services and middleware architectures. We discuss some of the relevant work in this area in the following section.

2.2.3 Middleware, Protocols and Services


As noted in section 2.2.1, the objective of developing a data grid is to encapsulate the low level mechanisms of storage, integration and querying of data stored at different geographical locations in tables, archives, libraries and other repositories. The datagrid community has to develop techniques to handle heterogeneous data repositories in an integrated way. Thus, the infrastructure should support standard protocols and services
6 It is assumed that the reader is familiar with basic steps involved in the development of a web service and we refrain from a detailed discussion of this context

11

and build re-usable middleware components for storage, access and integration of data. These are areas of active research and we briey discuss them in this section. GridFTP: The Data Transfer Protocol [9, 269] was designed to provide secure and efcient data movement in the Grid environment. In many applications on the data grid [257, 205, 274, 86], GridFTP is used as the protocol for data transfer between different components of the system. It is an extension of the standard FTP protocol. In addition to the standard features of the FTP, the GridFTP also provides Grid Security Infrastructure(GSI) and Kerberos support, third party control of data, parallel, striped and partial data transfer. While a detailed discussion of the protocol specications is outside the scope of the proposal, an interested reader should refer to [268] for further details. Several other projects have resorted to development of middleware for data access and integration. A relational database middleware approach ([270]) and serviceoriented approaches ([205, 69, 68]) have already been proposed. In the Spitre project [246] within the European Data Grid project [93] grid enabled middleware services have been used to access the relational data tables. The client and the middleware service communicate using XML over GSI enabled secure HTTP. The middleware service and the relational databases communicate using JDBC / ODBC calls. While this approach is interesting, it is limited to the model of queries and transactions [271] and requires a lot of application dependent code to be written by programmers themselves. This creates a lack of portability among different databases and does not provide a metadata driven approach as advocated by the data grid architecture. In contrast to the relational database middleware approach, the service oriented approach focuses on providing services for the generic database functionalities such as querying, transactions etc. This introduces a level of abstraction or encapsulation, since the service descriptions contain denitions of what functionality is available to an user, without specifying how they are implemented in the underlying system. Thus a virtual database service may provide the illusion that a single database is being accessed, whereas in fact the underlying service can access several different types of data repositories. The Open Grid Services Architecture - Data Access and Integration (OGSA-DAI) project [205, 185, 161], conceived by the UK Database Task Force 7 is developing a common middleware solution to be used by different organizations, allowing uniform access to data resources using a service based architecture. The project aims to expose different types of data resources (including relational and unstructured data) to grids, allow data integration, provide a way of querying, updating, transforming and delivering data via web services and provide metadata about data resources to be accessed. The architecture of the OGSA-DAI infrastructure, illustrated in Figure 2.4 8 , depends on three main types of services: 1. Data Access and Integration Service Group Registry (DAISGR): The purpose of this service is to publish and locate metadata regarding the data resources
7 OGSA-DAI is working closely with the Database Access and Integration Services - Working Group (DAIS-WG) of the Global Grid Forum (GGF), the Open Middleware Infrastructure Institute (OMII) and the Globus team 8 This Figure has been adapted from [161]

12

Figure 2.4: The Architecture of the OGSA-DAI Services

and other services avaiable. Thus, clients can use DAISGR to query metadata of registered services and select the service that best suites their requirements. 2. Grid Data Service Factory (GDSF): It acts as an access point to data resources and allows creation of Grid Data Services (GDS). 3. Grid Data Service (GDS): It acts as a transient access point for the data source. Clients can access data resources using the GDS. When the service container starts up, the DAISGR is invoked and gets instantiated. On creation, a GDSF may register as a service with the DAISGR and enables discovery of other services and data resources using relevant metadata. The Grid Data Services are invoked at the request of clients wanting to access a particular resource. It is interesting to note that several different types of data resources are supported by the OGSA-DAI including Oracle, MySQL, DB2, SQLServer, PostgreSQL, Cloudscape, IBM Content Manager and even Data Streams. The infrastructure developed by the OGSA-DAI project is a popular9 data access and integration service for developing data grids and has been used in several astronomy, bioinformatics, medical research, meteorology, geo-science applications10. More recently, a service based Distributed Query Processor (DQP) has been developed to work with OGSA-DAI. OGSA-DQP extends OGSADAI by incorporating two new services: (1) Grid Distributed Query Service (GDQS): that compiles, optimizes, partitions and schedules distributed query execution plans over multiple execution nodes in the Grid. (2)Grid Query Evaluation Service (GQES): which is responsible for partitioning the query execution plans assigned by the GDQS.
9 The Grid toolkits such as Globus GT3.0 and Unicore (http://europar.upb.de/tutorials/tutorial03.html) do not have the facility of uniform data access using web services 10 A complete listing of projects using the OGSA-DAI software is available at http://www.ogsadai.org.uk/about/projects.php

13

It is interesting to note that none of the above mentioned projects still meet the need for schema integration. Traditionally, in the database community, data integration [168] has been dened as " ..the problem of combining data residing at different sources, and providing the user with a unied view of this data." This means that the process of data integration involves modelling of the relation between individual database schemas and the global schema obtained by integrating them. Given that the grid could contain horizontally or vertically partitioned data as mentioned in section 2.2.2, unstructured data and data streams, the problem of data integration becomes a non-trivial one11 . A decentralized service based data integration architecture for Grid databases has been proposed in the Grid Data Integration System (GDIS) [70]. It uses the middleware architecture provided by OGSA-DQP, OGSA-DAI and the Globus Toolkit. It is based on Peer Database Management System (PDMS) [10], a P2P based decentralized data management architecture for supporting data integration issues in relational databases. The basic idea is that any peer in the PDMS can contribute data, schema information, or mappings between schema forming an arbitrary graph of interconnected schemas. The GDIS system offers a wrapper / mediator based approach to integrate the data sources. The process of data storage, access and integration on the grid is still an area of active research. Existing protocols, services and middleware architectures described in this section, aim to solve related problems, but it is still open for research. As the architectures are evolving, researchers have also focused on data mining and knowledge discovery on the data grid infrastructure. The next section reviews this topic in some detail.

2.2.4 Data Mining on the Grid


Several research projects including the Knowledge Grid [40, 43, 46, 37], Grid Miner [220], Discovery Net [264, 1], TeraGrid [257], ADaM (Algorithm Development and Mining) [233] on NASAs Information Power Grid, and the DataCutter project [191] have focused on the creation of middleware / systems for data mining and knowledge discovery on top of the data grid. We briey review related work in this area. 1. The Knowledge Grid: Built on top of a grid environment, it uses basic grid services such as authentication, resource management, communication and information sharing to extract useful patterns, models and trends in large data repositories. It is organized into two layers Core K-grid layer and High level K-grid layer which is implemented on top of the core layer. The Core Kgrid layer is responsible for management of metadata describing data sources, third party data mining tools, algorithms and visualization. It comprises of two main services - Knowledge Discovery Services (KDS) and the Resource Allocation and Execution Management (RAEM). The knowledge discovery services
11 An example would be cross matching heterogeneously distributed astronomy catalogs from sky surveys described in section 3.4.5

14

Figure 2.5: The Knowledge Grid Architecture

extends the basic globus monitoring services and is responsible for managing metadata regarding which data repositories are to be mined, data manipulation and pre-processing and certain specic execution plans. The information thus managed is stored in three repositories - Knowledge Metadata Repository(KMR) which contains metadata regarding the data, software and tools, Knowledge Base Repository(KBR) that stores learned knowledge and Knowledge Execution Plan Repository (KEPR) which keeps track of the execution plans for a knowledge discovery process. The interested reader is referred to [71, 252, 45, 41, 42, 39] for further details regarding each of these repositories. The RAEM nds mapping between execution plans and resources available on the grid. The High level grid layer, built on top of the core grid mainly includes services used to compose, execute and validate the specic distributed data mining operations. The main services provided by it include data access services, tools and algorithms access services, execution plan management services and results representation service. The Figure 2.5 adapted from [43] illustrates the basic components. The architecture has been implemented in a toolset named VEGA (Visual Environment for Grid Applications) [73]. It is responsible for task composition, consistency checking and generation of the execution plan. A visual interface for the work ow management is an attractive feature. However, it appears that a more fundamental problem is to be able to design an algorithm or the steps required to perform distributed work ow management. As of now a distributed work ow management scheme does not exist, and this appears to a very important need for the Grid community. It would be interesting to see, if the work ow manager designed by the authors can be extended to a distributed work ow management scheme. While the authors take particular care in the description of an elaborate architecture, the distributed data mining algorithms implemented 15

on this architecture seems to be very short and inadequate for the current context. They perform experiments on network intrusion data the size of which was reported to be about 712 MB. This is a considerably small dataset considering that they are trying to make a case for a grid mining scenario. The data mining tasks described here also appear to be pretty straightforward and does not include a real distributed algorithm requiring extensive communication or synchronization12. It would be really useful to see how the current system scales with real distributed algorithms (such as clustering algorithms requiring multiple rounds of communication, distributed association rule mining algorithms) and larger datasets stored at different geographical locations.

Figure 2.6: The GridMiner Components

2. The Grid Miner: This project aims to integrate grid services, data mining and On-Line Analytical Processing (OLAP) technologies. The GridMiner-Core framework [127, 220] is built on top of the Open Grid Services Infrastructure (OGSI) and uses the services provided by it. On top of this infrastructure the following components are built: (1)GridMiner Information Service (GMIS): It is responsible for collecting and aggregating service data from all available grid services and has query interfaces for resource discovery and monitoring.(2)Grid Miner Logging and Bookkeeping Service (GMLB): It collects scheduling information, resource reservations and allocations, logging and error handling. (3)Grid Miner Resource Broker (GMRB): It is responsible for workload management and grid resource management. (4) Grid Data Mediation System (GDMS): This [219] is responsible for accessing and manipulation of the data repositories by providing an API that is capable of abstracting the process of data access from repositories to higher level knowledge discovery services. This system comprises of several components including: (a) GridMiner Service Factory(GMSF): provides a service creation facility (b) GridMiner Service Registry
12 A recent work [72] provides a meta-learning example, but a completely implemented system is still under development

16

(GMSR): provides a directory facility for the OGSA-DAI services (c)GridMiner Data Mining Service (GMDMS): provides the data mining algorithms and tools (d) GridMiner PreProcessing Service (GMPPS): encapsulates the functionality that is needed for pre-processing the data (e)GridMiner Presentation Service (GMPRS): provides facilities for visualization of models. (5) Replica Management: This comprises of a GridMiner Replica Manager(GMRM) and a GridMiner Replica Catalog (GMRC). (5) GridMiner Orchestration Service (GMOrchS): The orchestration service is an optional component that is capable of aggregating a sequence of data mining operations into a job. It acts as a workow engine that executes the steps involved in the complete data mining task (either sequentially or in parallel). It provides an easy mechanism for handling long running jobs. The Figure 2.6, adapted from [127] illustrates the components of the GridMiner architecture. A prototype application, Distributed Grid-enabled Induction of Decision Trees (DIGIDT) [128, 127] has been developed to run on the GridMiner. It is based on concepts introduced in SPRINT [244] but has a modied data partitioning scheme and workload management strategy. The interested reader is referred to [127, 220] for details of implementation of the algorithm and performance results presented herein. It must be noted that this algorithm closely resembles a truly distributed scenario as described in section2.3 but appears to be capable of handling homogeneously partitioned data only. 3. Discovery Net: The Discovery Net (DNET) [1, 264] project aims to build a platform for scientic discovery from data collected by high throughput devices. The infrastructure is being used by scientists from three different application domains including Life Sciences, Environmental Monitoring and Geo-hazard Modelling. The DNET architecture develops High Throughput Sensing(HTS) applications by using the Kensington Discovery Platform on top of the Globus services. The knowledge discovery process is based on a workow model. Services provided are treated as black boxes with known input and output and are then strung together into a sequence of operations. The architecture allows users to construct their own workows and integrate data and analysis services. A unique feature is the Discovery Process Markup Language (DPML) [143] which is an xml based representation of the workows. Processes created by DPML are re-usable and can be shared as a new service on the Grid by other scientists. Abstraction of workows and encapsulating them as new services can be achieved by the Discovery Net Deployment Tool. A workow warehouse [143] acts as a repository of user workows and allows the capability of querying and meta-analysis of workows. Another interesting feature is the InfoGrid [200] infrastructure, which allows dynamic access and integration of various data sets in the workows. Interfaces to SQl databases, OGSA-DAI sources, Oracle databases and custom designed wrappers are built to enable data integration. The interested reader is referred to [1, 264, 200, 143] for detailed descriptions of the architecture, components and applications of DNET. 4. TeraGrid: The TeraGrid project aims to provide a "CyberInfrastructure" [258, 17

122] by making use of resources available at four main sites - The San Diego Supercomputer Center (SDSC), Argnonne National Laboratory (ANL), Caltech and National Center for SuperComputing Applications (NCSA). The architecture of the TeraGrid project makes use of existing Grid software technologies and builds a "virtual" system that is comprised of independent resources at different sites. It consists of two different layers - the basic software components (Grid services) and the application services (TeraGrid Application Services) implemented using these components. The objective is to build a knowledge grid [26] throughout the science and engineering community. For example the Biomedical Informatics Research Group has been developed to allow researchers at geographically different locations to share and access brain image data and extract useful patterns and models from them. This enables TeraGrid to act as a knowledge grid in the biomedical informatics domain. A knowledge grid, thus conceived, is "the convergence of a comprehensive computational infrastructure along with scientic data collections and applications for routinely supporting the synthesis of knowledge from that data" [26]. In September 2004, the deployment of TeraGrid was completed enabling access to 40 teraops of computing power, 2 petabytes of computing storage, specialized data analysis and visualization schemes and high speed network access. 5. Algorithm Development and Mining (ADaM): The ADaM toolkit [233, 238], conceptualized on NASAs Information Power Grid, has been developed by the Information Technology and Systems Center (ITSC) at the University of Alabama in Huntsville. It consists of over 100 data mining and image processing components and is primarily designed for scientic and remote sensing data. The ADaM toolkit has been grid enabled by making use of Globus and CondorG frameworks. Several projects including Modeling Environment for Atmospheric Discovery (MEAD)[229], Linked Environments for Atmospheric Discovery (LEAD) [239] have made use of the grid enabled toolkit for data mining operations including classication, clustering, association rule mining, optimization, image processing and segmentation, shape detection and ltering schemes. In MEAD the goal is to develop a cyber infrastructure for storm and hurricane research allowing users to congure, model and mine simulated data, retrospective analysis of meteorological phenomenon, and visualization of large models. 6. Data Cutter: The Data Cutter project enables processing of scientic datasets stored in archival storage systems across a wide-area network. It has a core set of services on top of which application developers can build services on a need basis. The main design objective is to enable range queries and custom dened aggregations and transformations on distributed subsets of data. The system is modular in nature and contains client components, that interact with clients and obtain multi-dimensional range queries from them. The data access services enables low level I/O support and provides access to archival storage systems. The indexing module allows hierarchical multidimensional indexing on datasets including R-trees and their variants and other sophisticated spatial indexing schemes. The purpose of the ltering module is to provide an effective way of subsetting and data aggregation. The Data Cutter project however, does 18

not provide support for extensive distributed data mining facilities. It also does not support distributed stream based applications. Grid Enabled WEKA Research has also been done to Grid enable WEKA [273], a popular java based machine learning toolkit. Some of the projects working with this objective include Weka4WS [274, 86], GridWeka [231, 114], Federated Analysis Environment for Heterogeneous Intelligent Mining (FAEHIM)[94] and WekaG[184]. We briey summarize the contribution of each of these projects. 1. Weka4WS: The goal of this project is to support execution of data mining algorithms on remote Grid nodes by exposing the Weka Library as web services using the Web Services Resource Framework (WSRF)13 . The architecture of Weka4WS comprises of three kinds of nodes - storage nodes which contain the datasets to be mined, compute nodes on which the data mining algorithms are run and the user nodes which are the local machines of users. Local data mining tasks at a grid node are computed using the Weka library resident at that particular node, while remote computations are routed through the user nodes. The compute nodes contain web services compliant with WSRF and are therefore capable of exposing the data mining algorithms implemented in the Weka library as a service. GridFTP servers are executed on each storage node to allow data transfer. While the architecture for grid enabling Weka used here is interesting, it appears to have some disadvantages. First, the authors have limited themselves to use of weka data mining algorithms only. Thus they are unable to use a truly distributed data mining algorithm such as those described in section 2.3 and the framework appears to be running centralized data mining algorithms at different grid nodes. It is also unclear how these algorithms adapt in case of a heterogeneous data partitioning scheme as described in section 2.2.2. Second, the use of GridFTP server in storage nodes, is also restrictive since it does not allow complete exibility of the type of data resources used14 . 2. GridWeka: This is an ongoing work at the University of Dublin, which aims to distribute data mining weka algorithms (in particular weka classiers) over computers in an ad-hoc Grid. A client-server architecture is proposed such that all machines that are part of the Weka Grid have to implement the Weka Server. The client is responsible for accepting a learning task and input data, distributing the task of learning, load balancing, monitoring fault tolerance of the system and crash recovery mechanisms. Tasks that can be done using the Grid Weka include building a classier on a remote machine, labelling a dataset using a previously built classier, cross validation and testing. Needless to say, the client server architecture is not ideally suited for the Grid and there is no service oriented schemes in place yet for making use of the basic grid features such as security, resource management etc. 3. FAEHIM: The Federated Analysis Environment for Heterogeneous Intelligent Mining (FAEHIM) project[8] aims to provide a data mining toolkit using web
14 For 13 http://ws.apache.org/wsrf/wsrf.html

example, it is unclear how unstructured data is handled by gridFTP scheme

19

services and the Triana problem[255, 260] solving environment. The primary data mining activities supported include classication, clustering, association rule mining and visualization modules. The basic functionality is derieved from the Weka library and converted into a set of web services. Thus the toolkit consists of a set of data mining services, tools to interact with the services and a workow management system to assemble the services and tools. This project appears to be very similar to the Weka4WS project mentioned above. 4. WekaG: The WekaG toolkit aims at adapting the Weka Toolkit for the Grid using a client server architecture. The server side implements the data mining algorithms while the WekaG client is responsible for creation of instances of grid services and acts as an interface to users. The authors describe WekaG as an implementation of a more general architecture called Data Mining Grid Architecture (DMGA) which is geared towards coupling data sources and provides facilities for authorization, data discovery based on meta data, planning and scheduling resources. A prototype for this toolkit has been developed using the Apriori algorithm integrated into the Globus Toolkit 3. Future work for the project is aimed at development of other data mining algorithms for the Grid and compatibility with the WSRF technology. Next generation grid based systems are moving towards P2P and Semantic Grids [218, 254, 253, 138, 44]. However, we will leave this area virtually untouched in the proposal considering that work in this area is still in the nascent stages. Another interesting direction of research is Privacy Preserving Data Mining on Grids. While there is a need to solve problems in this arena, little [21] work has been done. The next section explores architectures and applications of data streams on the grid. Grid Data Streams Application areas like e-science ( AstroGrid [15], GridPP [113]), e-health ( Telemedicine [256], Mobile Medical Monitoring [194]) , e-business ( INWA [139]) produce distributed streaming data. This data needs to be analyzed and one way to ensure that researchers have easy access to streams is to port them to the grid [230]. In this way, individuals do not need to set up mobile devices, expensive equipment such as telescopes and satellites but can access interesting data streams published on the Grid. In the Equator Medical Devices Project [50] for example, the authors have adapted Globus Grid Toolkit (GT3) to support remote medical monitoring applications on the Grid. Two different medical devices - the monitoring jacket and the blood glucose monitor are made available on the grid as services. Data miners can access the data for the purpose of knowledge discovery and pattern recognition without having to go through the trouble of setting up an environment for collecting the data or even owning the equipments themselves. Several other advantages of porting data streams to the grid include sharing of data on-the-y, easy storage of large streams and reduction in network trafc. In recent years, different architectures have been proposed for porting data streams to the grid [259, 221, 223, 222, 25, 61, 176]. We briey review related architectures

20

and applications, making note of the fact that none of these architectures incorporate distributed data mining facilities on grid data streams.

Figure 2.7: The Virtual Stream Store Plale et al ([222, 221] propose a model for bringing data streams to the grid, based on the ability of stream systems to act as a data resource. They argue, it is possible to treat each stream source as a grid service but the approach may not scale to the entire range of data stream generation devices from large hadron colliders in physics experiments to tiny motes in sensor networks. Thus an architecture for porting streams to the grid must cater to the needs of different types of data stream generation devices. The model proposed is based on three main assumptions: (1) data streams can be aggregated (2) They can be accessed through database operations and query languages (3) It is possible to access streams as grid services. The main motivation for proposing such a model comes from the fact that data streams can be viewed as indenite sequences of time-sequenced events and can be treated as a data resource like a database. This enables querying of global snapshots of streams and development of a virtual stream store as the architectural basis of stream resources. A virtual stream store is dened as follows: "...collection of distributed, domain-related data streams and set of computational resources capable of executing queries on the streams. The virtual stream store is accessed through a grid service. The grid service provides query access to the data streams for clients." [221] The concept of the virtual stream store is illustrated in Figure 2.7, adapted from [221]. It comprises of nine data streams and computational resources. The computational resources s are located very close to the streams (indicated by S in the Figure) but they could also act as stream generators. In general, it is not necessary for the generators to be a part of the virtual stream store. The model can act like a database

21

system for data stream stores having access to modied SQL type query languages. This architecture has been integrated into the dQUOB [223] and Calder systems [201, 277, 202]. While it is a rst step towards integration of data streams to grid, it has some drawbacks. It appears that this architecture does not consider heterogeneity of stream data and it is unclear how continuous queries can deal with heterogeneous streaming data15 . The Grid-based AdapTive Execution on Streams (GATES) project [61] aims to design and develop middleware for processing distributed data streams. The system is built using the Open Grid Sources Architecture (OGSA) and GT3. It offers a high-level interface that allows the users to specify the algorithm(s) and the steps involved in processing data streams without being concerned with resource discovery mechanisms, scheduling or allocating grid-based services. Hence the system is " self-resourcediscovering". The authors also refer to the system as "self-adapting" since a high degree of accuracy is obtained in analyzing data streams by tweaking certain parameters such as sampling rates, summary structures or algorithms. The goal of the self adaptation algorithm is to provide real time stream processing while keeping the analysis as precise as possible. This is obtained by maintaining a queuing network model of the system. It appears to be very close to the dQUOB system [223] although their scheme has capabilities for resource discovery and adaptation in distributed environments. In StreamGlobe [230], the authors propose the processing and sharing of data streams on Grid-based P2P infrastructures. The motivation is derieved from an astrophysical e-science application. The key features of the system include: (1) publishing data and retrieving information by interactively registering peers, data streams, and subscriptions (2) sharing of existing data streams in the network, thereby providing optimization and routing facilities (3) network trafc management capabilities by preventing overloading. This is a relatively new project and future work in the area is geared towards providing support for subscriptions with multiple input data streams and joins. Benford et al [25] describe their experiences in monitoring life processes in frozen lakes in Antarctic. They deploy remote monitoring devices on lakes of interest which send data to base stations over satellite phone networks. Integration of the sensing devices into a grid infrastructure as services, enables archiving of sensor measurements. The complete system consists of several components including (1) the Antarctic sensing device deployed on icy surface (2) A satellite telephony network to a base computer where the raw data is pre-processed (3) An OGSA compliant web service that makes the sensing device and its data available on the Grid (4) The data archived in a Grid accessible data repository (5) The data analysis and visualizing components of interest to the Antarctic scientist. While hurdles like erroneous remote sensor readings, software and hardware failure due to extreme weather conditions are still being sorted out, this system provides proof of concept of data analysis on streams ported to the Grid. This section emphasizes the idea that much interest has gone into developing architectures for supporting streams on the Grid. While the architectures themselves are in a nascent stage, even less research has been done to develop data mining algorithms on grid data streams. However, the need for knowledge discovery mechanisms is in15 More

recent work of the group includes studying the feasibility of continuous query grid services [202].

22

evitable. Keeping this in mind, we explore the relatively new area of distributed data mining in the next section.

2.3 Distributed Data Mining


2.3.1 Introduction
A primary motivation for Distributed Data Mining (DDM) discussed in literature and in this proposal, is that a lot of data is inherently distributed. Merging of remote data at a central site to perform data mining will result in unnecessary communication overhead and algorithmic complexities. As pointed out in [226], "Building a monolithic database, in order to perform non-distributed data mining, may be infeasible or simply impossible" (pg 4). For example, consider the NASA Earth Observing System Data and Information System (EOSDIS) which manages data from earth science research satellites and eld measurement programs. It provides data archiving, distribution, and information management services and holds more than 1450 datasets that are stored and managed at many sites throughout the United States. It manages extraordinary rates and volumes of scientic data. For example, Terra spacecraft produces 194 gigabytes (GB) per day; data downlink is at 150 Megabits/sec and the average amount of data collected per orbit is 18.36 Megabits/sec16 . A centralized data mining system may not be adequate in such a dynamic, distributed environment. Indeed, the resources required to transfer and merge the data on a centralized site may become implausible at such a rapid rate of data arrival. Data mining techniques that minimize communication between sites are quite valuable. Simply put, DDM is data mining where the data and computation are spread over many independent sites. For some applications, the distributed setting is more natural than the centralized one because the data is inherently distributed. Typically, in a DDM environment, each site has its own data source and data mining algorithms operate on it producing local models. Each local model represents knowledge learned from the local data source, but could lack globally meaningful knowledge. Thus the sites need to communicate by message passing over a network, in order to keep track of the global information. For example, a DDM environment could have sites representing independent organizations whose operation and data collection have nothing to do with each other and who communicate over the Internet. Typically communication is a bottleneck. Since communication is assumed to be carried out exclusively by message passing, a primary goal of many DDM methods in the literature is to minimize the number of messages sent. Some methods also attempt to load-balance across sites to prevent performance from being dominated by the time and space usage of any individual site. In the following sections we briey review DDM algorithms for classication and clustering. In a subsequent section 2.3.4 we also give an overview of stream data mining which has been receiving increasingly more attention in the last ten years. Since
16 This

information has been obtained from http://spsosun.gsfc.nasa.gov/eosinfo/EOSDIS_Site/index.html

23

the focus of this proposal is DDM on the grid infrastructure with emphasis on clustering, classication and data streams, we will leave many areas of DDM virtually untouched. For different perspectives on this exciting eld, the reader is referred to [154], [156],[214], [278], [279].

2.3.2 Classication
Distributed Classication is closely related to ensemble based classier learning [213]. Ensemble-based classiers work by generating a collection of base models and combining their outputs using some pre-specied schemes. Typically, voting (weighted or unweighted) schemes are employed to combine the outputs of the base classiers. A large volume of research reports that ensemble classier models often perform better than any of the base classiers used [79, 207, 23, 190, 166]. Two popular ensemble models of this kind are Bagging [34] and Boosting [103, 104, 240]. Both Bagging and Boosting build multiple classiers from different subsets of the original training data set. However, they take substantially different approaches in sampling subsets and combining classications. In Bagging, each training set is constructed by taking a bootstrap replicate of the original training set. This means that given a training set T that contains m tuples, the new training set is constructed by uniformly sampling (with replacement) from T. Classiers are built on the new training set and the results aggregated by majority voting. In Boosting, a set of weights are maintained over the original training set T and they are adjusted after classiers are learned using a learning algorithm. The adjustment procedure increases the weight of tuples that are mis-classied by the base learning algorithm and decrease the weight of those that are correctly classied. There are two different ways by which the weights can be used to form the new training set - boosting with sampling (tuples are drawn with replacement from the training set with probability proportional to the weight) and boosting by weighting (the learning algorithm can take a weighted training set directly). Other ensemble based approaches include Stacking[276, 95], Random Forest [35] and a recent work Rotation Forest [232]. Stacking [276, 95] learns from pre-partitioned training sets -s and validation sets -s. Given a set of learning algorithms = { }, it builds a classication model in two stages. During the rst stage, a set of classiers is learned from . In the second stage, classications for each are tested with a set of classiers learned from the previous stage, which results in, ( ). By applying the process repeatedly for all -s, a new training set is formed, which is then used to create the so-called meta-classier. Classication of an unseen data instance is also made in two stages. A meta-level testing instance ( ) is rst formed, which is then passed to the meta-classier for the nal classication. In Random Forests, a forest of classication trees is grown as follows: (1) If the training set has T tuples, sample T cases at random (with replacement), from the original data. (2) If there are M attributes, a number m M is specied such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing. (3) Each tree is grown to the largest extent possible. There is no pruning. The error rate of the Random Forest depends on the correlation between any two trees in the forest 24

5 )    5 )  5 ) 1A( $B10( % $C1A( 

    ) 32C5 )A( DC5 )0( % $B5 A(  @ 7 5 3 ) 9864210( '$&#$#"    %   !     

and the strength of each individual tree in the forest. While Random Forests have been shown to have good accuracy of classication, the major disadvantage is that the size (total number of nodes in trees of the forest) of the model built can be very large. A more recent work, Rotation Forest [232], builds classier ensembles based on feature selection. In order to create the training data set for the base classier, the attribute space is randomly split into K (a parameter of the algorithm) subsets and Principle Component Analysis (PCA) is applied to the subsets, retaining all the principle components. There are K axis rotations creating a new feature space for the base classiers. The objective of this method is to increase the individual accuracy of classiers and the diversity within the ensemble. Experimental results reported in the work claim that Rotation Forest can provide a better accuracy than Random Forest, Bagging and Boosting. All of the above ensemble learning schemes can be directly adapted for distributed classication. The individual sites can produce the base models from local data, and ensemble based aggregation schemes [164, 171, 172, 217] can be used for producing the nal result.

Figure 2.8: The Meta-Learning Framework Homogeneously Distributed Classiers: A slightly different ensemble learning technique for homogeneously distributed data is the meta-learning framework [55, 53, 54, 58, 59, 56, 57]. It follows three steps: (1) Build the base classiers using learning algorithms on local data at a site. (2) Base classiers are collected and combined at 25

a central site. Produce the meta-level data by using a validation set and predictions of base classiers on this set. (3) Build the meta-level classier from the meta-level data. The Figure 2.8 illustrates the process. Different strategies for combining multiple predictions from classiers in the meta-learning phase can be adopted. These include: 1. Voting: Each classier is assigned a single vote and majority wins. A variation of this process is Weighted Voting, where some classiers are assigned preferential treatment depending on the performance on some common validation set. 2. Arbitration: The arbiter acts like a "judge" and its predictions are selected if the participating classiers themselves cannot reach a consensus. Thus the arbiter itself is a classier which chooses a nal outcome based on the predictions of other classiers. 3. Combiner: The combiner makes use of the knowledge about how classiers behave with respect to one another and thereby enables meta-learning. There are several ways by which the combiner can be learnt. One way is to use the base classiers and their outputs. Yet another option is to learn the combiner from data comprised of training examples, correct classications and base classier outputs. The meta-learning framework is implemented in a system called Java Agents for Metalearning (JAM) [247, 248]. In general, meta-learning helps improve performance by executing in parallel and ensures better predictive capabilities by combining learners with different inductive bias such as search space, representation schemes and search heuristics. Other meta-learning based distributed classication schemes include [118] (Distributed Learning and Knowledge Probing) and [111]. Heterogeneously Distributed Classiers: The problem of learning over heterogeneously partitioned data is inherently difcult since different sites observe different attributes from the original data set. Traditional ensemble based approaches generate high variance local models and are not adept at identication of correlated features distributed over different sites. Hence the problem of learning over heterogeneously partitioned data is challenging. Park and his colleagues [210] have addressed the problem of learning from heterogeneous sites using an evolutionary approach. Their work rst identies a portion of the data that none of the local classiers can learn with a great deal of condence. This subset of the data is merged at the central site and another new classier is built from it. When a new instance cannot be learned with high condence from a combination of local classiers, the central classier is used. The approach produces better results than simple aggregation of local models. However, the algorithm is sensitive to the condence threshold selected. An algorithm to construct decision trees over heterogeneously partitioned data has been proposed by Giannella et all [107]. The algorithm is designed using random projection based dot product estimation and a message sharing strategy. Their work assumes that each site has the same number of tuples which are ordered to facilitate matching i.e. tuple on each site matches and also have the same class labels. The

G HF

26

aim is to construct a decision tree using attributes from all the sites. The problem boils down to estimating the information gain offered by attributes in making splitting decisions. To reduce communication, the information gain estimation is approximated using a random projection based approach. It must be noted that the decision tree obtained from the distributed framework may not be identical to the one obtained if all the data were centralized. However, increasing the number of messages exchanged can make the distributed tree arbitrarily close to the centralized tree. Also, the distributed algorithm requires more local computation than the centralized algorithm. Thus the overall benet of the algorithm is based on a trade-off: increased local computation or reduced communication. The work does not take into account actual communication delays in the network in the distributed setting and thus could benet from a more detailed timing study of the centralized and distributed settings. The above problem of constructing decision trees over heterogeneously partitioned data has also been addressed in work by Caragea et all [48, 47, 49]. However, their work focuses on producing an exact distributed learning algorithm, compared with decision trees constructed on centralized data. Thus it is fundamentally different from the inexact random projection based approach described in [107]. An order statistics based approach to combining classiers generated from heterogeneous sites has been presented in [261]. Their technique works by ordering predictions of the classiers and provides mechanisms for selecting an appropriate order statistic and making a linear combination of them. The work provides an analytical framework for quantifying the reduction in error obtained when an order statistics based ensemble is used. Their experimental results suggest that when there is signicant variability among the classiers, order statistics based approaches perform better than ordinary combiners. Collective Data Mining: The framework for Collective Data Mining (CDM) has been proposed by Kargupta and his colleagues [150]. It has its roots in the theory of communications, machine learning, statistics and distributed databases. The main objective of CDM is to ensure that partial models produced from local data at sites are correct and can be used as building blocks for forming the global model. The steps involve: (1) Chose an appropriate orthonormal representation for the type of data model to be built. (2) Construct approximate orthonormal basis coefcients at each local site. (3) If required, transfer a sample of the datasets from each local site to a coordinator site and generate approximate basis coefcients corresponding to the non-linear cross terms. (4) Combine the local models and output the nal model using a user specied representation format (such as decision trees). A major contribution, is the use of the CDM approach to construct decision trees from data through the Fourier analysis. It has been pointed out that there are several different techniques to do this. One possibility is to use the Non-uniform Fourier Transformation (NFT) [36]. Computation of the Fourier coefcients require all the members of the domain to be available. However, in most learning frameworks, a training set is used to learn the model and it is tested on a validation (test) set. In such a framework, estimation of the Fourier coefcients themselves is not easy. The Non-uniform Fourier Transformation provides a potential solution. If it is assumed that the class label is zero for all members of the domain that are not in the learning set, then the Fourier spec27

trum exactly represents the data. This is called NFT and it can be shown that one can exactly construct the decision tree from the NFT of the data. However, this approach has potential drawbacks. It does not guarantee polynomial description or exponentially decaying magnitude of the coefcients. Also communication of the NFT of the data may require a substantial overhead. Another approach is to estimate the Fourier spectrum of the tree directly from the data, instead of the NFT of the data. This method has several advantages including: (1) Decision trees with bounded depth are generally useful for data mining. (2) The Fourier representation of bounded depth (say d) decision trees has polynomial number of non-zero coefcients. Coefcients corresponding to partitions involving more than d features are zero. (3) If the number of dening features determine the order of a partition, then the magnitude of the Fourier coefcients decay exponentially with the order of the corresponding partition. The existence of these properties guarantees that if there is a straightforward way to estimate the coefcients themselves, then decision trees can be built over distributed data using very little communication. An iterative approach to modelling the error obtained in approximating the Fourier coefcients obtained from non-local sites has been proposed. A more detailed analysis of this work is presented in [211]. The Fourier representation of decision trees and procedure to reconstruct trees from the Fourier Spectrum has been studied in much detail [209, 211, 212, 149, 152, 151]. Removing Redundancies from Ensembles: Existing ensemble-learning techniques work by combining (usually a linear combination) the output of the base classiers. They do not structurally combine the classiers themselves. As a result they often share a lot of redundancies. The Fourier representation of decision trees, referred to in the discussion above, offers a unique way to fundamentally aggregate the trees and perform further analysis to construct an efcient representation. The work on Orthogonal Decision Trees [153, 89, 119] focuses on this issue. Consider a matrix where , where is the output of the tree for input . is an matrix where is the size of the input domain and is the total number of trees in the ensemble. An ensemble classier that combines the outputs of the base classiers can be viewed as a function dened over the set of all rows in . If denotes the th column matrix of then the ensemble classier can be viewed as a function of . When the ensemble classier is a linear combination of the outputs of the base classiers we have , where is the column matrix of the overall ensemble-output. Since the base classiers may have redundancy, it is possible to construct a compact low-dimensional representation of the matrix . However, explicit construction and manipulation of the matrix is difcult, since most practical applications deal with a very large domain. We can try to construct an approximation of using only the available training data. One such approximation of and its Principal Component Analysis-based projection is reported elsewhere [190]. Their technique performs PCA of the matrix , projects the data in the representation dened by the eigenvectors of the covariance matrix of , and then performs linear regression for computing the coefcients and . While the approach is interesting, it has a serious limitation. First of all, the con28

s av

vrr x Q v S s Bi I s r y% Bi I % v  Bi I  wut Q x Q

R BpI Qi

 rrr% v `&#av

d e db 5Y ) X W A`0( R U I

g fd h#e db 5Y ) X W U S Q 6`0( R VTR 0PI I

I ca1) b 7 Y I I I

s Ci $& % Bi   Bi I Q Irrr Q I Q

X W R U

struction of an approximation of even for the training data is computationally prohibiting for most large scale data mining applications. Moreover, this is an approximation since the matrix is computed only over the observed data set of the entire domain. In recent work [153, 89, 119], a novel way to perform a PCA of the matrix containing the Fourier spectra of trees has been reported. The approach works without explicitly generating the matrix . It is important to note that the PCA-based regression scheme [190] offers a way to nd the weightage for the members of the ensemble. It does not offer any way to aggregate the tree structures and construct a new representation of the ensemble which the current approach does. Now consider a matrix where , i.e. the coefcient corresponding to the -th member of the partition set from the spectrum of the tree . It can be shown that the covariance matrices of and are identical [119]. Note that is a dimensional matrix. For most practical applications . Therefore analyzing using techniques like PCA is signicantly easier. PCA of the covariance matrix of W produces a set of eigenvectors . The eigenvalue decomposition constructs a new representation of the underlying domain. Since the eigenvectors are nothing but a linear combination of the original column vectors of W, each of them also form a Fourier spectrum and we can reconstruct a decision tree from this spectrum. Moreover, since they are orthogonal to each other, the tree constructed from them also maintain the orthogonality condition. The analysis presented above, offers a way to construct the Fourier spectra of a set of functions that are orthogonal to each other and therefore redundancy-free. These functions also dene a basis and can be used to represent any given decision tree in the ensemble in the form of a linear combination. Orthogonal decision trees can be dened as an immediate extension of this framework. We present the theoretical denitions and experimental results on the performance of Orthogonal Decision Trees (ODTs) in section 3.2. In this section we discussed several algorithms and techniques for distributed classication. The following section introduces distributed clustering.

2.3.3 Clustering
Distributed clustering algorithms can be broadly divided into two categories: (1) methods requiring multiple rounds of message passing and (2) centralized ensemble methods [38, 141]. Algorithms that fall into the rst category require signicant amount synchronization. The second consists of methods that build local clustering models and transmit them to a central site (asynchronously). The central site forms a combined global model. These methods require only a single round of message passing, hence, modest synchronization requirements. In the next two subsections we discuss these issues in some details. Multiple Communication Round Algorithms Kargupta et al. [157] develop a principle components analysis (PCA) based clustering technique on the CDM framework for heterogeneously distributed data. Each local site perform PCA, projects the local data along the principle components, and applies a known clustering algorithm. Having obtained these local clusters, each site 29

rrr @ D& % $ 

X W R U

d d d eb we d

I X QX W Q  W R TS R 0

g f d d

sends a small set of representative data points to a central site. This site carries out PCA on this collected data and computes global principal components. The global principle components are sent back to the local sites. Each site projects its data along the global principle components and applies its clustering algorithm. A description of locally constructed clusters is sent to the central site which combines the cluster descriptions using different techniques such as nearest neighbor methods. Klusch et al. [160] consider kernel-density based clustering over homogeneously distributed data. They adopt the denition of a density based cluster from [126]: data points which can be connected by an uphill path to a local maxima, with respect to the kernel density function over the whole dataset, are deemed to be in the same cluster. Their algorithm does not nd a clustering of the entire dataset. Instead each local site nds a clustering of its local data based on the kernel density function computed over all the data. In principal, their approach could be extended to produce a global clustering by transmitting the local clusterings to a central site and combining them. However, carrying out this extension in a communication efcient manner is non-trivial task and is not discussed by Klusch et al. Eisenhardt et al. [90] develop a distributed method for document clustering. They extend -means with a "probe and echo" mechanism for updating cluster centroids. Each synchronization round corresponds to a -means iteration. Each site carries out the following algorithm at each iteration. One site initiates the process by marking itself as engaged and sending a probe message to all its neighbors. The message also contains the cluster centroids currently maintained at the initiator site. The rst time a node receives a probe (from a neighbor site with centroids ), it marks itself as engaged, sends a probe message (along with ) to all its neighbors (except the origin of the probe), and updates the centroids in using its local data as well as computing a weight for each centroid based on the number of data points associated with each. If a and weights ), it merges site receives an echo from a neighbor (with centroids and with its current centroids and weights. Once a site has received either a probe or echo from all neighbors, it sends an echo along with its local centroids and weights to the neighbor from which it received its rst probe. When the initiator has received echoes from all its neighbors, it has the centroids and weights which take into account all datasets at all sites. The iteration terminates. Dhillon and Modha [78] develop a parallel implementation of the -means clustering algorithm on homogeneously distributed data. A similar approach is taken by Forman and Zhang [105] to extend it to the problem of -harmonic means. The problem of clustering on P2P networks has been addressed in recent work [2, 236]. Their algorithm works as follows: each node in the P2P network is provided with a random number generator (same for all the sites) that produces the same set of initial centroid seeds when the algorithm begins. The points in the local data are rst assigned to the nearest centroid. Then the centroids are updated to produce dimensionwise mean of the points. If there is a drastic change in centroids (measured by a user dened parameter) then a ag is raised indicating a change in centroids. A particular node N, will poll neighboring nodes for their centroids. The choice of neighborhood is determined in two different ways - uniform sampling and immediate neighborhood. The node N computes the weighted mean of the centroids it receives with its local centroid to produce the nal set of centroids for a particular iteration. While this is 30

'

'

'

the rst known P2P clustering algorithm, it appears to be an asynchronous algorithm. Moreover, it does not deal with dynamic network topology, as is common in peer to peer networks. A further extension of this algorithm taking into consideration large dynamic networks, has been studied by Datta et all [236]. It must be noted that while all algorithms mentioned in this category require multiple rounds of message passing, [157] and [160] require only two rounds. The others require as many rounds as the algorithm iterates. Centralized Ensemble-Based Methods These algorithms typically have low synchronization requirements and potentially offer two other nice properties: (1) If the local models are much smaller than the local data, their transmission will result is excellent message load requirements. (2) Sharing only the local models may be a reasonable solution to privacy constraints in some situations [188]. A brief survey of the literature is presented. Johnson and Kargupta [145] develop a distributed hierarchical clustering algorithm on heterogeneously distributed data. It rst generates local cluster models and then combines these into a global model. At each local site, the chosen hierarchical clustering algorithm is applied to generate local dendograms which are then transmitted to a central site. Using statistical bounds, a global dendogram is generated. Samatova et al. [237] develop a method for merging hierarchical clusterings from homogeneously distributed, real-valued data. Lazarevic et al. [167] consider the problem of combining spatial clusterings to produce a global regression-based classier. They assume homogeneously distributed data and that the clustering produced at each site has the same number of clusters. Each local site computes the convex hull of each cluster and transmits the hulls to a central site along with regression model for each cluster. The central site averages the regression models in overlapping regions of the hulls. Strehl and Ghosh [250] develop methods for combining cluster ensembles in a centralized setting (they did not explicitly consider distributed data). They argue that the best overall clustering maximizes the average normalized mutual information over all clusters in the ensemble. However, they report that nding a good approximation directly is very time-consuming. Instead they develop three more efcient algorithms which are not theoretically shown to maximize mutual information, but are empirically shown to do a decent job. Fred and Jain [102] develop a method for combining clusterings in a centralized setting. Given clusterings of data points, their method rst constructs an , coassociation matrix (the same as as described in [250]). Next a merge algorithm is applied to the matrix using a single link, threshold, hierarchical clustering technique. For each pair whose co-association entry is greater than a predened threshold, merge the clusters containing these points. In principal both Strehl and Ghoshs ideas and Fred and Jains approach can be readily adapted to heterogeneously distributed data. However, for Strehl and Ghoshs ideas to be adapted to a distributed setting, the problem of constructing an accurate centralized representation of using few messages need be addressed. In order for Fred and Jains approach to be adapted to a distributed setting, the problem of building

31

f T

dy' e

5 qF f$A(

an accurate co-association matrix in a message efcient manner must be addressed. Merugu and Ghosh [188] develop a method for combining generative models 17 produced from homogeneously distributed data. Each site produces a generative model from its own local data. Their goal is for a central site to nd a global model from a predened family (e.g. multivariate, 10 component Gaussian mixtures). which minimizes the average Kullback-Leibler distance over all local models. They prove this to be equivalent to nding a model from the family which minimizes the KL distance from the mean model over all local models (point-wise average of all local models). They assume that this mean model is computed at some central site. Finally the central site computes an approximation to the optimal model using an EM-style algorithm along with Markov-chain Monte-carlo sampling. They did not discuss how the centralized mean model was computed. But, since the local mode ls are likely to be considerably smaller than the actual data, transmitting the models to a central site seems to be a reasonable approach. Januzaj et al. [144] extend a density-based centralized clustering algorithm, DBSCAN, by one of the authors to a homogeneously distributed setting. Each site carries out the DBSCAN algorithm, a compact representation of each local clustering is transmitted to a central site, a global clustering representation is produced from local representations, and nally this global representation is sent back to each site. A clustering is represented by rst choosing a sample of data points from each cluster. The points are chosen such that: (i) each point has enough neighbors in its neighborhood (determined by xed thresholds) and (ii) no two points lie in the same neighborhood. Then -means clustering is applied to all points in the cluster, using each of the sample points as an initial centroid. The nal centroids along with the distance to the furthest point in their -means cluster form the representation (a collection point, radius pairs). The DBSCAN algorithm is applied at the central site on the union of the local representative points to form the global clustering. This algorithm requires an parameter dening a neighborhood. The authors set this parameter to the maximum of all the representation radii. Methods [144], [188], and [237] are representatives of the up-and-coming class of distributed clustering algorithms, centralized ensemble-based methods. These algorithms focus on transmitting compact representations of a local clustering to a central site which combines to form a global clustering representation. The key to this class of methods is in the local model (clustering) representation. A good one faithfully captures the local clusterings, requires few messages to transmit, and is easy to combine. Two techniques which solve a closely related but different problem (which they call distributed clustering). They address the problem of forming clusters of distributed datasets. Each one of their clusters is a collection of datasets, not a collection of tuples from datasets. McClean et al. [186] consider clustering a collection of data cubes. Parthasarathy and Ogihara [234] consider clustering homogeneously distributed tables. Having discussed distributed classication and clustering algorithms, we now focus our attention on distributed data stream mining. The following section introduces the topic.
17 a generative

model is a weighted sum of multi-dimensional probability density functions i.e. components

32

2.3.4 Distributed Data Stream Mining


Distributed Data Stream Mining (DDSM) is becoming an area of active research ([20, 181, 163, 27, 148]) due to the emergence of geographically distributed stream-oriented systems such as online network intrusion detection applications, sensor networks, vehicle monitoring systems, web click streams, and systems analyzing multimedia data. In these applications, data streams originating from multiple remote sources needs to be monitored. A central stream processing system will not provide a good solution since streaming data rates may exceed the infrastructure of storage, communication, and processing [20]. Thus there arises a need for distributed data stream mining 18 . Several projects deal with data mining on streams including [117, 131, 87, 97, 106]. While these are closely related, it is not clear whether all of them can be directly applied in a distributed setting. In this section we provide a brief review of current work done in the eld of distributed data stream mining . Babcock and Olston [20] describe a distributed top-K monitoring algorithm, designed to continuously report the k largest values from distributed data streams (top-k monitoring queries). Such queries are particularly useful in tracking atypical behavior, such as distributed denial of service attacks, exceptionally large or small values in telephone call records, auction bidding patterns and web usage statistics. The approach to solving the problem is as follows: The co-ordinator maintains the top k set initially. It installs arithmetic constraints at each monitor node over partial data values. As updates occur in the distributed streams, the arithmetic constraints should always be satised. If there is a conict between the coordinator and the monitor nodes a conict resolution scheme is resorted to. Thus distributed communication is only needed when the constraints imposed in the system are violated. The main drawback of this scheme seems to be the fact that the procedure of updating and reallocation is not instantaneous and thus overall conict resolution scheme may not happen in real time. The problem of mining frequent item sets from multiple distributed data streams has been studied by Manjhi et all [181]. A naive solution to the problem is to combine frequency counts from the distributed nodes. However, as the number of nodes increases, a large amount of data structures need to be stored. The authors suggest a solution which is based on the precision of the frequency count maintained at each node.They introduce a hierarchical communication structure that maintains an error tolerance for frequency counts at each level. This is referred to as the precision gradient. The setting of the precision gradient is posed as an optimization problem with the objective of (1) minimizing load on the central node to which answers are delivered or (2) minimizing the worst case load on any communication link in the hierarchical structure. Kotecha et all [163] address the problem of distributed classication of multiple targets in a wireless sensor network. They cast it as a hypothesis testing problem. The major concern in multi-target classication is that as the number of targets increases, the hypotheses increase exponentially. The authors propose to re-partition the hypoth18 There exists a signicant amount of work in stream data architectures ([22, 177]), query processing ([206, 60, 178, 19]), stream-based programming languages and algebras ( [208, 243, 74]), and applications ( [235, 204]). The current proposal will not focus on these problems and a reader interested in data streams is referred to [22] for a detailed overview.

33

esis space to reduce the exponential complexity. Ghoting and Parthasarathy [11] present algorithms for mining distributed streams with interactive response times. Their work performs a Directed Acyclic Graph (DAG) based decomposition of queries over distributed streams and makes use of this scheme to perform k-median clustering. They introduce a way to effectively update clustering parameters (such as k) by distributed interactive operator scheduling. A ticket based scheduling algorithm is presented along with an optimal distributed operator allocation for interactive data stream processing in a distributed setting. The authors adapt a graph partitioning scheme for stream query decomposition. While this is an interesting approach, it remains to be seen how well this approach can scale in large real time systems. The VEhicle DAta Stream Mining (VEDAS) project [148] is an experimental system for monitoring vehicle data streams in real time. It is one of the very early distributed data mining systems that perform most of the data analysis and knowledge discovery operations in onboard computing devices. The data collected from onboard monitoring devices such as PDAs are subjected to principal component analysis for dimensionality reduction. Since performance of PCA in a resource-constrained environment may be expensive, the authors present ways to monitor changes in covariance matrix which is useful for incremental PCA and avoids re computation in the entire PCA process. The fault detection module of the application handles vehicle health data. It makes use of incremental clusters to represent safe regimes of operation and can automatically monitor outliers from new vehicle data. The paper also provides mechanisms for drunk driver detection, which can be viewed as locating deviations from normal or characteristic behavior. In this section, we discussed several algorithms and applications for distributed stream mining. We argue that since data repositories on the grid are heterogeneous in nature, can be static or streaming, distributed data mining can play an important role in extracting patterns from repositories on the grid. In the next section we analyze the challenges for distributed data mining on the grid.

2.4 The Challenges


A Data Grid can be thought of as a distributed system having the following characteristics: 1. It comprises of several resources (computers, sensors etc) storing data repositories (Relational, XML databases, at les etc). 2. The resources do not share a common memory or a clock. 3. They can communicate with one another by exchanging messages over a communication network. 4. Each resource has its own memory and can perform limited / extensive data intensive tasks.

34

Join Attribute(X)

Table 2.1: Matched Catalog P and Q.

5. The data repositories owned and controlled by a resource are said to be local to itself, while repositories owned by other machines are considered remote. 6. Accessing remote resources in the network are more expensive than local resources, since this includes communication delays and CPU overhead to process communication protocols. 7. The resources are capable of forming virtual organizations amongst themselves. Members of a virtual organization are allowed to share data under local policies which specify what is shared, who is allowed to share and the conditions for sharing. Sharing amongst disparate virtual organizations is allowed, although policies for sharing could be guided by different rules. Thus, a Data Grid can be conceived such that there is either (1) a hierarchy amongst virtual organizations. (2) complete de-centralization amongst virtual organizations ([253, 254]). Given the characteristics of the Data Grid, let us examine what a service-oriented architecture for distributed data mining on the grid requires. 1. Distributed Data Integration: The purpose of this is to integrate heterogeneous data repositories. Schema integration is a difcult problem, given that the data repositories contain different types of data (relational, unstructured, data streams), different attributes, indices are not all aligned and the criteria required to integrate them may be complex. We illustrate with examples from astronomy. Example 1 (1) Consider the catalogs P and Q shown in Table 2.2. For the sake of discussion, we assume Catalog P has X and A attributes and Catalog Q has X and B attributes. We further assume that X is the join attribute. (2) The Table 2.1 illustrates one possibility of aligning the catalogs. Notice that the attribute matches with and both and . Thus attribute appears twice in the matched catalog. Also, if either Catalog P or Q has a join attribute that the other does not have, that tuple will not show in the matched catalog. Example 2 Consider that catalog A has Cartesian co-ordinates and Catalog B stores co-ordinates , both representing the spatial positions of astronomical objects (stars, galaxies etc). It is required to perform a

35

5q 3 $po i (

% i

l

m Cl n Dl %l

%j % fj %j  kj

5%q% 3 p#2% i (

nl

%l

m i % i % i  i

%j

% i

Table 2.2: Catalog P (Left) and Catalog Q (Right).

join between the two catalogs based on the probabilistic calculation that minimizes the paramter in the following equation

(2.1) Note that is a weighting parameter calculated from the astrometric precision of the survey, and is the Langrange multiplier in the minimization to ensure that the is a unit vector. The co-ordinates from catalog A and B where the values of are minimized are chosen to be the cross matched vertices and form the integrated table. Complex data integration on the grid may also require manipulation of indexing schemes (e.g spatial indices such as B-trees or R-trees) and integration of XML data sources. Very little work has been found in literature that specically deals with schema integration on grids. The OGSA-DAI and OGSA-DQP [205] projects are dealing with service based architectures for Data Access and Integration and Distributed Query Processing. However, these projects do not explicitly deal with schema integration issues. Carmela Comito and Domenico Talia propose the Grid Data Integration System (GDIS) [70] for integrating heterogeneous XML data sources. This is a de-centralized service based integration architecture and handles semantic heterogeneity over data repositories. Alexander Wohrer et al. [6, 5] propose a Grid Data Mediation Service, which serves as a data integration system for the GridMiner framework (described in section 2.2.4). The main purpose of the mediation service is to present distributed, heterogeneous data sources as one virtual data source on the grid using a exible mapping scheme. They support both structured and unstructured data. The InfoGrid [200] infrastructure of the Discovery Net project described in section 2.2.4 offers another interesting approach for data integration. However, all of the previously mentioned approaches do not deal with integration or updation of indexing schemes inherently present in the data repositories. Many scientic databases, such as huge astronomy catalogs typically depend upon spatial indexing mechanisms (B-trees, R-trees and their variants, or Hierarchical Triangular Mesh (HTM19 ). Attempts to integrate data repositories should
19 http://www.sdss.jhu.edu/htm/

| z q z 3 z (y x w { u z % q x % 3 x % i y `v z C% 5 % q  A( x % 5 % 3 ' 0( x % 5 % i  i A9v av S % t { u u

36

ml n Dl %l  l

 i

m i % i % i

ns ms %s s

Tuple ID

Join Attribute(X)

Tuple ID

Join Attribute(X)

%j %j  kj % t

m i % i  i | % Bt #$pop ( 5q 3 i v

m %  r

also consider incorporation of these indexing schemes in the virtual tables created. 2. Distributed Data Preprocessing: Data normalizations, dealing with missing value problems are some of the basic data pre-processing operations that need to be performed on the repositories. In order to enable uniformity in the policies chosen across all repositories, some amount of communication needs to take place amongst the resources. Services designed to handled distributed data preprocessing will be particularly useful. 3. Distributed Data Mining: The DDM algorithms implemented on the grid infrastructure are expected to have the following features: (a) Scalability: The Data Grid was envisioned to support data of the order of tera or peta bytes. Centralization of data is generally not an option. Thus DDM algorithms for the Grid should be designed such that they have low communication cost and are independent of the resources present in the grid. (b) Decentralized: The algorithms should be able to run in the absence of a central co-ordinator. (c) Locality Sensitive: The DDM algorithms on the grid must communicate with members in the virtual organization. This necessitates them to be local algorithms communicating only within a certain neighborhood. However, if the Data Grid is organized in such a manner that all the resources belong to a single virtual organization, then this restriction can be relaxed. In this particular case, the distributed algorithms can afford to be global algorithms having complete knowledge of the network. (d) Asynchronous: Since data repositories are of different sizes, have different schemas and query mechanisms, distributed data mining operations on resources may take varying amounts of time. Also, the communication (bandwidth) delay in the network is unpredictable. Thus synchronized algorithms may not suit the requirements of Data Grids. (e) Fault Tolerance: Process failures, communication link failures on the grid necessitate that the DDM algorithms are fault tolerant. (f) Privacy and Security: The algorithms should honor the privacy of the individual resources in a Data Grid. 4. Distributed Workow Composition: In order to enable distributed schema and data integration, execution of algorithms, co-ordination of partial results, visualization and graphical representation of DDM results, it is important to be able to successfully perform the composition of grid-based services. Workow composition has thus been an area of active research [52, 82, 81, 80, 198, 169, 142]. There are several well known web services ow specication languages such as BPEL4WS20 and Web Services Choreography Interface (WSCI21 ). Related
21 http://www.w3.org/TR/wsci/ 20 http://www-128.ibm.com/developerworks/library/specication/ws-bpel/

37

workow execution engines include IBMs Business Process Execution Language for Web Services JavaTM Run Time (BPWS4J22 ), Collaxa23 BPEL4WS Server and the Self-Serv Environment for web service composition [24]. Triana [66, 179, 180] is a workow based Problem Solving Environment (PSE) that has been used by scientists for a range of tasks including signal, text and image processing. It aims to make seamless access to distributed services, connects heterogeneous grids and abstracts the core capabilities needed for service-based computing (in P2P, Grid Computing or Web Service). Some of the interesting features of the Triana framework include: (1) Easy GUI based composition of web services (2) Distributed execution of composite web services (3) It accomodates features for Sensitivity Analysis (4) It allows users to annotate workows and maintain provenance information. The web service composition comprises of several mechanisms including: (1) service discovery: It allows the location of relevant services. Querying the UDDI or importing a WSDL document are two possibilities for service discovery. (2) service composition using a GUI (3) Transparent execution methods and distribution of workows across P2P or Grid frameworks (4) Transparent publishing of services. A data mining toolkit that enables web service composition has been developed by Ali, Rana and Taylor [7]. It is built on top of WEKA and uses the Triana Workow environment. The toolkit is able to handle classication, clustering and association rule mining. Although current technologies provide the basic foundations for web service composition, there are still many open research problems. We list some of these issues, noting that these challenges need to be addressed before a composite service based distributed data mining toolkit can be built on the grid: (a) Service composition still requires signicant amount of low level programming on the part of the developers and users. Thus it requires a lot of overhead for development, testing and maintenance. (b) The number of services to be composed can be large, dynamic and may require signicant communication among themselves. (c) Very few [24, 66] of the existing technologies deal with distributed workows. In this section, we outlined the challenges in development of a distributed data mining system on a grid infrastructure. The following chapter examines preliminary work.

22 http://www.alphaworks.ibm.com/tech/bpws4j

23 http://www.javaskyline.com/20030311_collaxa.html

38

Chapter 3

Preliminary Work
3.1 Introduction
The primary purpose of the proposal is to motivate research in distributed data mining on a grid infrastructure. While substantial work is being done in the elds of distributed data mining (see section 2.3) and grid mining (see section 2.2.4) separately, there seems to have been very little effort in bringing the two technologies together. In the astronomy community, for example, several sky-surveys (such as SDSS [242], 2Mass [3], POSS [224]) are coming online as part of the National Virtual Observatory [203]. An effort is being made to bring the NVO to the grid, using the TeraGrid 1 framework. As mining across several surveys becomes an appealing idea, architectures for distributed data mining on the grid are still evolving. Consider the problem of classication of galaxies and stars across multiple sky surveys. To exploit information from different catalogs, there needs to be a mapping between an object in one catalog to an object in another (cross-match), thus creating a virtual matched table. However, the different resolutions of surveys can cause two close objects in one survey to appear as a single object in another 2. The problem can be addressed by probabilistic associations between objects. Once a cross-match has been performed, a single classier or an ensemble may be used for building the model. A related problem is that of learning, in large NASA astrophysics mission data streams3 for automatic discovery of merging / colliding galaxies. In order to achieve this, architectures need to be developed for incorporating data streams to the grid and algorithms for distributed data stream mining have to be built. In this section we rst discuss Orthogonal Decision Trees (introduced in section 2.3.2) as a method for removing redundancies in ensemble classiers. We briey review the Fourier representation of decision tree ensembles introduced elsewhere [151, 211, 212, 209], provide a technique to construct redundancy-free Orthogonal Decision Trees (ODTs) based on eigen-analysis of the ensemble and offer experimen2 http://www.siam.org/news/news.php?id=411 1 http://www.us-vo.org/grid.cfm

3 http://is.arc.nasa.gov/IDU/tasks/NVODDM.html

39

tal results to document the performance of ODTs on grounds of accuracy and model complexity. Next, we examine the application of ODTs to a data stream scenario 4 . Finally, we examine the feasibility of distributed data mining on federated astronomy catalogs. We describe the framework of the National Virtual Observatory, propose an architecture for the Distributed Exploration of Massive Astronomy Catalogs (DEMAC) system and its integration with the grid.

3.2 Orthogonal Decision Trees


Decision tree [227] ensembles are frequently used in data mining and machine learning applications. Boosting [103, 88], Bagging[34], Stacking [276], and Random Forests [35] are some of the well-known ensemble-learning techniques. Many of these techniques often produce large ensembles that combine the outputs of a large number of trees for producing the overall output. Ensemble-based classication and outlier detection techniques are also frequently used in mining continuous data streams [96, 249]. Large ensembles pose several problems to a data miner. They are difcult to understand and the overall functional structure of the ensemble is not very actionable since it is difcult to manually combine the physical meaning of different trees in order to produce a simplied set of rules that can be used in practice. Moreover, in many time-critical applications such as monitoring data streams in resource-constrained environments [151], maintaining a large ensemble and using it for continuous monitoring are computationally challenging. So it will be useful if we can develop a technique to construct a redundancy-free meaningful compact representation of large ensembles. This section presents a technique to construct redundancy-free decision-tree-ensembles by constructing orthogonal decision trees. The technique rst constructs an algebraic representation of trees using multivariate discrete Fourier bases. The new representation is then used for eigen-analysis of the covariance matrix generated by the decision trees in Fourier representation. The proposed approach then converts the corresponding principal components to decision trees. These trees are dened in the original attributes-space and they are functionally orthogonal to each other. These orthogonal trees are in turn used for accurate (in many cases with improved accuracy) and redundancy-free (in the sense of orthogonal basis set) compact representation of large ensembles. The main motivation behind this approach is to create an algebraic framework for meta-level analysis of models, produced by many ensemble learning, data stream mining, distributed data mining, and other related techniques. Most of the existing techniques treat the discrete model structures such as decision trees in an ensemble primarily as a black box. Only the output of the models are considered and combined in order to produce the overall output. Fourier bases offers a compact representation of a discrete structure that allows algebraic manipulation of decision trees. For example, we can literally add two different trees, produce weighted average of the trees themselves or perform eigen analysis of an ensemble of trees. Fourier representation of decision trees may offer something that is philosophically similar to what spectral representation
4 We

use a physiological health monitoring data stream for illustration

40

of graphs [65] offersan algebraic representation that allows deep analysis of discrete structures. The Fourier representation allows us to bring in the rich volume of well-understood techniques from Linear Algebra and Linear Systems Theory. This opens up many exciting possibilities for future research, such as quantifying the stability of an ensemble classier, mining and monitoring mission-critical data streams using properties of the eigenvalues of the ensemble. The following section reviews the Fourier representation of decision trees.

3.2.1 Decision Trees and the Fourier Representation


This section reviews the Fourier representation of decision tree ensembles, introduced elsewhere [149, 152]. Decision Trees as Numeric Functions The approach described in this section makes use of linear algebraic representation of the trees. In order to do that that we rst need to convert the trees into a numeric tree just in case the attributes are symbolic. A decision tree dened over a domain of categorical attributes can be treated as a numeric function. First note that a decision tree is a function that maps its domain members to a range of class labels. Sometimes, it is a symbolic function where attributes take symbolic (non-numeric) values. However, a symbolic function can be easily converted to a numeric function by simply replacing the symbols with numeric values in a consistent manner. Since the proposed approach of constructing orthogonal trees uses this representation as an intermediate stage and eventually the physical tree is converted back the exact scheme for replacing the symbols (if any) does not matter as long as it is consistent. Once the tree is converted to a discrete numeric function, we can also apply any appropriate analytical transformation as necessary. Fourier transformation is one such interesting possibility. Fourier representation of a function is a linear combination of the Fourier basis functions. The weights, called Fourier coefcients, completely dene the representation. Each coefcient is associated with a Fourier basis function that depends on a certain subset of features dening the domain. This section reviews the Fourier representation of decision tree ensembles, introduced elsewhere [151]. A Brief Review of the Fourier Basis in the Boolean Domain Fourier bases are orthogonal functions that can be used to represent any discrete function. In other words, it is a functionally complete representation. Consider the set of all -dimensional feature vectors where the -th feature can take different categorbasis ical values. The Fourier basis set that spans this space is comprised of functions. Each Fourier basis function is dened as,

41

 | `~

 B ys ~ R eD

 | ~

S 5 ) '10(

valued range, can be represented using the Fourier basis functions:

the complex conjugate of ; . The Fourier coefcient can be viewed as the relative contribution of the partition to the function value of . Therefore, the absolute value of can be used as the signicance of the corresponding partition . If the magnitude of some is very small compared to other coefcients, we may consider the -th partition to be insignicant and neglect its contribution. The order of a Fourier coefcient is nothing but the order of the corresponding partition. We shall often use terms like high order or low order coefcients to refer to a set of Fourier coefcients whose orders are relatively large or small respectively. Energy of a spectrum is dened by the summation . Let us also dene the inner product between two spectra and where is the column matrix of all Fourier coefcients in an arbitrary but xed order. Superscript denotes the transpose operation and denotes the total number of coefcients in the spectrum. The inner product, We will also use the denition of the inner product between a pair of real-valued functions dened over some domain . This is dened as The following section considers the Fourier spectrum of decision trees and discusses some of its useful properties. Properties of Decision Trees in the Fourier Domain For almost all practical purposes decision trees have bounded depths. This section will therefore consider decision trees of nite depth bounded by some constant. The underlying functions in such decision trees are computable by a constant depth Boolean AND and OR circuit (or equivalently circuit). Linial et al. [174] noted that the Fourier spectrum of circuit has very interesting properties and proved the following lemma.

where denotes the order (the number of non-zero variable) of partition j and is a non-negative integer. The term on the left hand side of the inequality represents the 42

% 2$D6 v v c%

6X W B

Lemma 1 (Linial, 1993) Let

and be the size and depth of an

k X W2r& % X   X W S X Q rr Q W Q W %

5 ) 5 ) B10( % 21A(  4f S 1A( % DC1A(  5 ) 5 )



Q Q W X W  X$% W XH  S !X% Wp H c d d

51A(251A( S 10( ) ) 5 )



X W %

X W 2 c

T

5 ( p

where

is the Fourier Coefcient (FC) corresponding to the partition and

5 ) 1A( 51A( 1A ) S 5 )(

where and are vectors of length ; and are -th attribute-value in x and j, respectively; and represents the feature-cardinality vector, ; is called the j-th basis function. The vector is called a partition, and the order of a partition is the number of non-zero feature values it contains. A Fourier basis function depends on some only when the corresponding . If a partition has exactly number of non-zeros values, then we say the partition is of order since the corresponding Fourier basis function depends only on those number of variables that take non-zero values in the partition . A function , that maps an -dimensional discrete domain to a realis

circuit. Then

S yq

sq

v i 5 ) 1A(  | &4 | rrr | q rrr  H | & u kp7 s  s i ) s i }  h 5 )( 10

energy of the spectrum captured by the coefcients with order greater than a given constant . The lemma essentially states the following properties about decision trees: 1. High order Fourier coefcients are small in magnitude. 2. The energy preserved in all high order Fourier coefcients is also small. The key aspect of these properties is that the energy of the Fourier coefcients of higher order decays exponentially. This observation suggests that the spectrum of a Boolean decision tree (or equivalently bounded depth function) can be approximated by computing only a small number of low order Fourier coefcients. So Fourier basis offers an efcient numeric representation of a decision tree in terms of an algebraic function that can be easily stored and manipulated. The exponential decay property of Fourier spectrum also holds for non-Boolean decision trees. The complete proof is given in the appendix. There are two additional important characteristics of the Fourier spectrum of a decision tree that we will use in this section: 1. The Fourier spectrum of a decision tree can be efciently computed [151]. 2. The Fourier spectrum can be directly used for constructing the tree. In other words, we can go back and forth between the tree and its spectrum. This is philosophically similar to the switching between the time and frequency domains in the traditional application of Fourier analysis for signal processing. These two issues will be discussed in details later in this proposal. However, before that we would like to make a note of one additional property. Fourier transformation of decision trees preserves inner product. The functional behavior of a decision tree is dened by the class labels it assigns. Therefore, if are the members of the domain then the functional behavior of a decision tree can be captured by the vector , where the superscript denotes the transpose operation. The following lemma proves that the inner product between two such vectors is identical to the same in between their respective Fourier spectra. Lemma 2 Given two functions in Fourier representation. Then Proof:

The fourth step is true since Fourier basis functions are orthonormal. 43

QY 4f Q X W Q X W '10( Y 1A( Y H% T 2  QX W QX W  'X% Wp XH WpS $%  H  S 5 ) 5 ) QY 4f 4f QX W 5 ) QX W 5 ) 10( Y Y $% 1A( 2  S 5 )( % 5 )(  '10o21A4

X W X W '$% 2 2 S 1A( % DC1A(  5 ) 5 ) QX W QX W 5 ) 10( % T 10( % S 5 ) 5 ) 10( H T 10(  S 5 )

5 A&r`A2&0 S 4 )( r r r 5 % )( 5  )( b

5 )( 10 2&4`2& )rrr % ) )

10o$B10 5 )( %  5 )( 

and

X3
0 1

1
0

X1
1

X2
0 1 0

X2
1

Figure 3.1: A Boolean decision tree

3.2.2 Computing the Fourier Transform of a Decision Tree


The Fourier spectrum of a given tree can be computed efciently by efciently traversing the tree. This section rst reviews an algorithm to do that. It discusses aggregation of the multiple spectra computed from the base classiers of an ensemble. It also extends the technique for dealing with non-Boolean class labels. Kushilevitz and Mansour [165] considered the issue of learning the low order Fourier spectrum of the target function (represented by a Boolean decision tree) from a data set with uniformly distributed observations. Note that the current contribution is fundamentally different from their goal. We do not try to learn the spectrum directly from the data. Rather it considers the problem of computing the spectrum from the decision tree generated from the data. Schema Representation of a Decision Path For the sake of simplicity, let us consider a Boolean decision tree as shown in Figure 3.22. The Boolean class labels correspond to positive and negative instances of the . concept class. We can express a Boolean decision tree as a function The function maps positive and negative instances to one and zero respectively. A node in a tree is labeled with a feature . A downward link from the node is labeled with an attribute value of the -th feature. The path from the root node to a successor node represents the subset of data that satises the different feature values labeled along the path. These subsets of the domain are essentially similarity-based equivalence classes and we shall call them schemata (schema in singular form). If is a schema, then , where denotes a wildcard that matches any value of the corresponding feature. For example, the path in Figure 3.22 represents the schema , since all members of the data subset at the nal node of this path take feature values and for and respectively. We often use the term to represent the number of non-wildcard values in a schema. The following section describes an algorithm to extract Fourier coefcients from a tree.

44

 u

 '#

( 4% i  i   i  m i o

m i

 i

u  D u kp  7

p

Extracting and Calculating Signicant Fourier Coefcients from a Tree Considering a decision tree as a function, the Fourier transform of a decision tree can be dened as:

Where denotes the complete instance space, is an instance subspace which leaf node covers and is a schema dened by a path to respectively (Note that any path to a node in a decision tree is essentially a subspace or hyperplane, thus it is a schema). Lemma 3 For any Fourier basis function

Proof: Since Fourier basis functions form an orthogonal set,

Lemma 4 Let be a schema dened by the path to a leaf node . Then if j has a non-zero attribute value at a position where has no value (wild-card),

Proof: Let , where are features which are included and are features not in respectively. Since all values for are xed in , is constant for all . And forms redundant (multiples of) complete domain with respect to . Therefore for a leaf node ,

Lemma 5 For any Fourier coefcient whose order is greater than the depth of a leaf node , . If the order of is greater than the depth of tree, . then Proof: The proof immediately follows from Lemma 4.

45

 S 5 ) 1A( CH

#f #f #f 5 Y 6( 5 Y A10( C 10( ( S 5 ) 5 ) 5 Y A10( 5 Y A ( S 5 ) (

5 i ( a Y &p G

x p

S S 5 ) 1A( #fV 0

7 2 i 5 ( SG H x 

S 5 ) 5 )( 10( 10

Where

is the subset that

covers.

Here,

is the zero-th Fourier basis function, which is constant (one) for all .

5 ) 5 )( 10( 10
(3.1)

Y AF 5 5 ( d aA( aA d d d x  x a6( 1a6 d d d x 46( #6 d d d   5 5 ( d 5 5 ( d S #f  d d #f  d d 4f  d d #f d d  5 ) 5 )( 5 ) 5 )( S 5 ) 5 )( 10( 10 S x  x 10( 10 x 10( 10 u u u u Y kS510( #f 5 A1A( 1A 4f ) ( S 5 ) 5 )( Y S 5 )  S 5 ) 5 ) k'10( 10( #f 1A( #f S 5 ) 1A( #f Y

#f

Thus, for a FC to be non-zero, there should exist at least one schema h that has non-wild-card attributes for all non-zero attributes of j. In other words, there exists a set of non-zero FCs associated with a schema h. This observation leads us to a direct way of detecting and calculating all non-zero FCs of a decision tree: For each schema h (or path) from the root, we can easily detect all non-zero FCs by enumerating all FCs associated with h. Before describing details of the algorithm, let us dene some notations. Operator is dened over a vector h and an attribute-value pair, . If is the -th attribute (or feature), it outputs a new vector by replacing -th value of h with . For example, for h = 1** and the feature , . Here, we assume indexing starts from zero. operates on both schemata and partitions. Next, let us dene a function , which takes in as an input and outputs a set of pairs. Here, denotes that is the -th attributes and is a non-zero value of . If has the cardinality of , then . Let us dene another operator over a set of partition S and . It outputs a new set of partitions by applying over all possible pairs between S and . This can be considered as Cartesian product of S and . For example, let S = {000,010} attribute of cardinality two (with possible non-zero value of 1). Then, and be the = {(2,1)} and . Finally, let us consider a non-leaf node that has children. In other words, there exist disjoint subtrees below . If is the feature appearing in , then denotes the average output value of domain members covered by a subtree accessible through the -th child of . For example, in Figure 3.2, is and is one. Note that is equivalent to the average of schema h, where h denotes the path (from the root node) to -th subtree of the node where appears. The algorithm starts with pre-calculating all -s (This is essentially recursive and corresponding is calcuTree-Visit operation). Initially, lated with overall average of output. In Figure 3.2, it is: . The algorithm continues to extract all remaining non-zero FCs in recursive fashion from the root. If we assume that the tree in Figure 3.2 is built from data with three attributes and then we can write and is computed using Equation 3.1:

For , similarly as

and . and are computed . The pseudo code of the algorithm is presented in Figure 3.3.

Fourier Spectrum of an Ensemble Classier The Fourier spectrum of an ensemble classier that consists of multiple decision trees can be computed by aggregating the spectra of the individual base models. Let be the underlying function computed by a tree-ensemble where the output of the ensemble is a weighted linear combination of the outputs of the base tree-classiers. 46

 $By  By 4u $ u 4p% u S 5 z S v z 5 u S u u u t 5 4 (  5 (  5 t S 5 (  5 ( u oHB 5 u ( cf uv x 4ooHB 6( cf uv # u 4B 4 u cf uv

S u f % x n f %  S &$ k4p 5F HA( t F @ i % 5 5u( t  A( t 5F A( p t @ i (  ( S 5 ( u  u S 5 u  v " u C5 u  v y4pp% i u

5 )( 10

5 #@ ( i 5 @ i ( 5 u T@ | r&B5 v 6&B5 u z  (     (  F

 By

@ i

5F  @ i (

@ i

S 5 ( u p i

5 ( #@ i r6(fS5 @ i ( @ | @ i 5 @ i ( 4u 5 u  u 8  i u S @ (

 By (  S u $k4V % i i V z (f u f uv x u f uv f uv S 5 (  x #f4B 5 ( v S  4ocf u HCy

4x v

5F A( t F

% i

@ i @ i 5F H2r( 5 ( % i % i  i i 

X1
Average = Average = 1

X2
Average = 0 Average = 1

1
1

Figure 3.2: An instance of Boolean decision tree that shows average output values at each subtree.

and . The following section extends the Fourier spectrumbased approach to represent and aggregate decision trees to domains with multiple class labels. Fourier Spectrum of Multi-Class Decision Trees A multi-class decision tree has different class labels. In general, we can assume that each label is again assigned a unique integer value. Since such decision trees are also functions that map an instance vector to numeric value, the Fourier representation of such tree is essentially not any different. However, the Fourier spectrum cannot be directly applied to represent an ensemble of decision trees that uses voting as its aggregation scheme. The Fourier spectrum faithfully represents functions in closed forms and ensemble classiers are not such functions. Therefore, we need a different approach to model a multi-class decision trees with the Fourier basis. Let us consider a decision tree that has classications. Then let us dene to be the Fourier spectrum of a decision tree whose class labels are all set to zero except the -th class. In other words, we treat the tree to have a Boolean classication with respect to the -th class label. If we dene to be a partial function that computes the inverse Fourier transform using , classication of an input vector x is written as: 47

R 1A 5 ) 10( S 5 )( AF X  W f

where and are decision tree and its weight respectively. zero Fourier coefcients that are detected by decision tree and coefcient in . Now equation 3.2 is written as:

is set of nonis a Fourier , where

5 ) 1A( X Wx

x j x & x 10( H  5 ) X W

j S 5 ) k1A( x x j x  x 10&fj x 10#4Bkj  5 )( % % 5 )(   5 ) X 1A( C@ W @ v S x u AF fj o F

  X  W j x

5 )( 1A#

S 5 )( 10 F

Figure 3.3: Algorithm for obtaining Fourier spectrum of a decision tree. denotes the cardinality of attribute and denotes the size of subspace covers. is the size of the complete instance space. in line 4 denotes that is the -th attribute. , where each corresponds to a mapped value for the -th classication. Note that if x belongs to -th class, when , and 0 otherwise. Now let us consider an ensemble of decision trees in weighted linear combination form. Then can be written as: , where and represent the weight of -th tree in the ensemble and its partial function for the -th classication respectively. Finally, the classication of an ensemble of decision tree that adopts voting as its aggregation scheme can be dened as: argmax In this section, we discussed the Fourier representation of decision trees. We showed that the Fourier spectrum of a decision tree is very compact in size. In particular, we proved the exponential decay property is also true for a Fourier spectrum of non-Boolean decision trees. In the next section, we will describe how the Fourier spectrum of an ensemble can be used to construct a single tree.

3.2.3 Construction of a Decision Tree from Fourier Spectrum


This section discusses an algorithm to construct a tree from the Fourier spectrum of an ensemble of decision trees. The following section rst shows that the information gain needed to choose an attribute at the decision nodes can be efciently computed from the Fourier coefcients. 48

` d d

1 Function ExtractFS(input: Partition Set S, Node 2 feature appearing in 3 4 position of 5 6 for each 7 for each possible value of 8 9 10 end 12 end 13 14 for child node of 15 16 ExtractFS(S, ) 17 end 18 end

, Schema h)

F 5 ) X % 5 ) X  S 5 ) X x 1A(  W% % fj x 10( 2 W  j 91A( C@ W

S 5 ) X u y10(  W

@ |

o`

@ i

55 ) X 21A( C W@ ( @

oj

o`

o`

5 5F t f q aA( H0( p ceoF x 5F ( H2r ra Y S 5 )( 10 o` Y  o` d o` d 5 ) X 1A( C@ W j 5 ) X 1A( C W@ @ i

F 5 ) 1A( X W x &r x 10( $ W% % x 1A( H W  rr 5 ) X 5 ) X

@ i

5Hr6 Y F ( F A P

c8 7 q x !oF 5 #@ ( i p e@ i

5 ) 1A( X W 2&r jrr

S 5 )( 10

q aF S

Schema Average and Information Gain Consider a classication problem with Boolean class labels . Recall that a schema h denotes a path to a node in a decision tree. In order to compute the information gain introduced by splitting the node using a particular attribute, we rst need to compute the entropy of the class distribution at that node. We do that by introducing a quantity called schema average. Let us dene the schema average function value as follows:
f y d d u

where is the classication value of x and denotes the number of members in schema h. Note that the schema average is nothing but the frequency of all instances of the schema with a classication value of . Similarly, note that the frequency of the tuples with classication value of is . It can therefore be used to compute the entropy at the node . condence entropy max

The computation of using the above expression for a given ensemble is not practical since we need to evaluate all . Instead we can use the following expression that computes directly from the given FS:

where h = that has non-wildcard values at position and . A similar Walsh analysis-based approach for analyzing the behavior of genetic algorithms can be found elsewhere [110]. Note that the summations in Equation 3.3 are dened only for the xed (non-wild-card) positions that correspond to the features dening the path to the node . Using Equation 3.3 as a tool to obtain information gain, it is relatively straightforward to come up with a version of ID3 or C4.5-like algorithms that work using the Fourier spectrum. However, a naive approach may be computationally inefcient. The computation of requires an exponential number of FCs with respect to the order increases exponentially as the tree beof h. Thus, the cost involved in computing comes deeper. Moreover, since the Fourier spectrum of the ensemble is very compact in size, most Fourier coefcients involved in computing are zero. Therefore, the evaluation of using Equation 3.3 is not only inefcient but also involves unnecessary computations. Construction of a more efcient algorithm to compute is possible by taking advantage of the recursive and decomposable nature of Equation 3.3. When computing the average of order schema h, we can reduce some computational steps if any of 49

55 a6(

   7 z R | & l g sl %l l 4Cu C`&#u X B D QQ QQ Q&   Q W # $ #  S 5 A(  " X !   6! "  W #%

 u

(  55 z u r2A(

55 a6(

5 A(

5 A(

z u(

5   5 z u ( z a6( A( z 55 2A( z u CA( ( 5

d d

 5 )( B10

5 a6(

5 a6(

@ a

7 c)

S 5 A(

(3.2)

5 A(

S 5 A( S 5 a6(

5 a6(

5 A(

5 A(

5 )( 10

(3.3)

Figure 3.4: Algorithm for constructing a decision tree from Fourier spectrum (TCFS). the order -1 schemata which subsumes h is already evaluated. For a simple example . Let us also in the Boolean domain, let us consider the evaluation of assume that is pre-calculated. Then, is obtained by simply adding and to . This observation leads us to an efcient algorithm to evaluate schema averages. Recall that the path to a node from the root in a decision tree is represented as a schema. Then, choosing an attribute for the next node is essentially the same as selecting the best schema among those candidate schemata that are subsumed by the current schema and whose orders are just one higher. In the following section, we describe a tree construction algorithm that is based on these observations. Bottom-up Approach to Construct a Tree Before describing the algorithm, we need to introduce some notations. Let and be two schemata. The order of is one higher than that of . Schema is identical to except at one positionthe -th feature is set to . For example, consider schemata h = (*1**2) and . Here we assign an integer number-based ordering among the features (zero for the denotes a set of partitions that are required to compute leftmost feature). (See Equation 3.3). A -xed partition is a partition with a non-zero value at the -th position. Let be a set of order one -xed partitions; be the partial sum of which only includes -xed partitions. Now the information gain achieved by choosing the -th feature with a given h is redened using these new notations:

50

@ |

where tively.

is the Cartesian product and

is the cardinality of the -th feature, respec-

55 @ 2`CA(

S 5 @ ( `CA' S`Ba6( 5 @ S 5 @ `C6(

entropy

S 5  k$'A(

Gain

entropy

entropy

5 A(

`C@ `C@

X 0 ( C W@ 1)X W  5 @ `CA( 5 @ ( `CA' x A( 5  5 @ 5 @ (   5 @ z u (r52`Ba6( z u ( z 5H`CA( `CA( z  @ | 5 `C@ 6( 5 z a6(  u

5 #T u ( 5 #8h u (

5 @ ( `CA'

5 vp S $m (  u u `C@

5#4T u (

pp pp

HCHBy z  $HC$" 5 # u (

5 ( a6%

p

1 2 3 4 5 6 7

Function TCFS(input: Fourier Spectrum FS) Initialize Candidate Feature Set CFSET create node h (***...***) Build(h, FS,SFSET) return end

5 ( kk&

5 @ `C6(

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 end 21 remove with the maximum from CFSET 22 23 FS FS - Marked 24 for each possible branch of 25 update h with 26 Build( ,FS, CFSET) 27 end 28 add into CFSET 29 add Marked into FS 30 return 31 end

Function Build(input: Schema h, Fourier Spectrum FS, Candidate Feature Set CFSET) create node odr Marked for each Fourier Coefficient within odr from FS ft = intersect(h,i,CFSET) if ft is not for each value of ft update with end add to Marked end end if Marked is set label for using average of h return end for each feature in CFSET

Figure 3.5: Algorithm for constructing a decision tree from Fourier spectrum (TCFS). order(h) returns the order of schema h. intersect(h, i) returns the feature to be updated using , if such a feature exists. Otherwise it returns . Now we are ready to describe the Tree Construction from Fourier Spectrum (TCFS) algorithm, which essentially notes the decomposable denition of and focuses on computing -s. Note that with a given h (the current path), selecting the next feature is essentially identical to choose the -th feature that achieves the maximum . Therefore, the basic idea of TCFS is to associate most up-to-date -s with the -th feature. In other words, when TCFS selects the next node (after some is chosen for ), becomes the new h . Then, it identies a

51

5 @ `B6(

Y "

F j 4

F S P Dl

F `C@ S @

p

5 r0 36' R 2 (

5 u x a6fop u ( 5 ( p `C@


5  ( F j 5 F j 6#$'Aoe6``4

pp

p

5 @ ( 1C6'

p Dl ` 8 Y 7

F 5 `C@ A( 5  ( F j k$'Aoe5

set of FCs (We call these appropriate FCs) that are required to compute all -s for each feature and computes the corresponding entropy. This process can be considered to update each for the corresponding -th feature as if it were selected. The reason is that such computations are needed anyway if a feature is to be selected in the future along the current path. This is essentially updating -s for a feature using bottom-up approach (following the avor of dynamic programming). Note that is, in fact, computable by adding to . Here -s are partial sums that only current appropriate FCs contribute. Detection of all appropriate FCs requires a scan over the FS. However, they are split from the FS once they are used in computation, since they are no longer needed for the calculation of higher order schemata. Thus it takes a lot less time to compute higher order schemata; note that it is just opposite to what we encountered in the naive implementation. The algorithm stops growing a path when either the original FS becomes an empty set or the minimum condence level is achieved. The depth of the resulting tree can be set to a pre-determined bound. A pictorial description of the algorithm is shown in Figure 3.6. Pseudo code of the algorithm is presented in Figures 3.4 and 3.5. The TCFS uses the same criteria to construct a tree as that of the C4.5. Both of them require a number of information-gain-tests that grows exponentially with respect to the depth of the tree. In that sense, the asymptotic running time of TCFS is the same as that of the C4.5. However, while the C4.5 uses original data to compute information gains, TCFS uses a Fourier spectrum. Therefore, in practice, a comparison of the running time between the two approaches will depend on the sizes of the original data and that of Fourier spectrum. The following section presents an extension of the TCFS for handling non-Boolean class labels. Extension of TCFS to Multi-Class Decision Trees The extension of TCFS algorithm to multi-class problems is immediately possible by redening the entropy function. It should be modied to capture an entropy from the multiple class labels. For this, let us rst dene to be a schema average function that uses (See Section 3.2.2) only. Note that it computes the average occurence of the -th class label in h. Then the entropy of a schema is redened as follows. entropy

where is the number of class labels. This expression can be directly used for computing the information gain to choose the decision nodes in a tree for classifying domains with non-Boolean class labels. In this section, we discussed a way to assign a condence to a node in a decision tree, and considered a method to estimate information gain using it. Consequently, we showed that a decision tree construction from the Fourier spectrum is possible. In particular, we devised TCFS algorithm that exploits the recursive and decomposable nature of tree building process in spectrum domain, thus constructing a decision tree efciently. In the following section, we will discuss empirical verication of the proposed Fourier spectrum-based aggregation approach.

52

`C@

5 @ ( `CA'

5 @ `CA(

5 X A(  W

5 A(

5A(  X W

5 X A(  W

5 @ ( `CA'

 z @

S 5 a6(

5 @ 1C6(

5 @ `CA( F

X1 (0***) (1***) 1 (00**) 0 (10**) 0

X2

X3

X4 (***0) Update (***1)

FS

X2 0 X3 0 X1 1

(*0**) (**0*) (*1**) (**1*)

(*00*) 0 (*01*) 0

(*0*0) Update 0 (*0*1) 0


Update

(000*) (100*) Evaluated Schemata

(*000) (*001)

A snapshot of Tree Construction

Split of Fourier Spectrum

Figure 3.6: Illustration of the Tree Construction from Fourier Spectrum (TCFS) algorithm. It shows the constructed tree on the left. The schemata evaluated at different orders are shown in the middle. The rightmost tree shows the splitting of the set of all Fourier coefcients used for making the process of looking up the appropriate coefcients efcient.

3.2.4 Removing Redundancies from Ensembles


Existing ensemble-learning techniques work by combining (usually a linear combination) the output of the base classiers. They do not structurally combine the classiers themselves. As a result they often share a lot of redundancies. The Fourier representation offers a unique way to fundamentally aggregate the trees and perform further analysis to construct an efcient representation. Let be the underlying function representing the ensemble of different decision trees where the output is a weighted linear combination of the outputs of the base classiers. Then we can write,

Where

is the weight of the

decision tree and

is the set of all partitions with

, non-zero Fourier coefcients in its spectrum. Therefore, where and . Therefore, the Fourier spectrum of (a linear ensemble classier) is simply the weighted sum of the spectra of the member trees.

53

QX W 5 ) C1A( s T

5 ) 1A( Q W  Q W  s S X T6av s S X  Q W 9 5 ) 10( X T  R 1A( S 5 ) v AF  R 9 9 QX W rr 5 ) v S 5 ) X W rr 5 ) X W 5 ) X W S 5 ) s v x &r x 1A( H   1A( s U s v x &r x 10( $% U % v x 1A( H #U  v 1A(

5 ) 10(

Consider the matrix where , where is the output of the tree for input . is an matrix where is the size of the input domain and is the total number of trees in the ensemble. An ensemble classier that combines the outputs of the base classiers can be viewed as a function dened over the set of all rows in . If denotes the th column matrix of then the ensemble classier can be viewed as a function of . When the ensemble classier is a linear combination of the outputs of the base classiers we have , where is the column matrix of the overall ensemble-output. Since the base classiers may have redundancy, we would like to construct a compact low-dimensional representation of the matrix . However, explicit construction and manipulation of the matrix is difcult, since most practical applications deal with a very large domain. We can try to construct an approximation of using only the available training data. One such approximation of and its Principal Component Analysis-based projection is reported elsewhere [190]. Their technique performs PCA of the matrix , projects the data in the representation dened by the eigenvectors of the covariance matrix of , and then performs linear regression for computing the coefcients and . While the approach is interesting, it has a serious limitation. First of all, the construction of an approximation of even for the training data is computationally prohibiting for most large scale data mining applications. Moreover, this is an approximation since the matrix is computed only over the observed data set of the entire domain. In the following we demonstrate a novel way to perform a PCA of the matrix containing the Fourier spectra of trees. The approach works without explicitly generating the matrix . It is important to note that the PCA-based regression scheme [190] offers a way to nd the weightage for the members of the ensemble. It does not offer any way to aggregate the tree structures and construct a new representation of the ensemble which the current approach does. are meanThe following analysis will assume that the columns of the matrix zero. This restriction can be easily removed with a simple extension of the analysis. Note that the covariance of the matrix is . Let us denote this covariance matrix by . The -th entry of the matrix,

The fourth step is true by Lemma 2. Now let us the consider the matrix where , i.e. the coefcient corresponding to the -th member of the partition set from the spectrum of the tree . Equation 3.4 implies that the covariance matrices are identical. Note that is an dimensional matrix. For most of and practical applications . Therefore analyzing using techniques like PCA is signicantly easier. The following discourse outlines a PCA-based approach. PCA of the covariance matrix of W produces a set of eigenvectors . The eigenvalue decomposition constructs a new representation of the underlying domain. Note that since the eigenvectors are nothing but a linear combination of the original column vectors of W, each of them also form a Fourier spectrum and we can reconstruct a decision tree from this spectrum. Moreover, since they are orthogonal to 54

X W X W QX W Q W 'R 2  S @ R T @ X T

rrr @ D& % $ 

s av

vrr x Q v S s Bi I s r y% Bi I % v  Bi I  wut Q x Q

R BpI Qi

 rrr% v `&#av

d e db 5 Y 0( R U ) X W I I
@

S 10( R 2B10(  U S ofBf 5 ) X WU5 ) X W 5 q ( I5F ( I

g f d d

g f d e db 5 Y 0( R S R A I ) X WU Q I I I

X W R 4U

d d d eb e d

b 7 Y ) I

s C$&#% B BI Qi Irrr Qi I Qi

5 qF fD0(

X W R S R 0 QX W Q

X W R U S

(3.4)

R 0 Q I

each other, the tree constructed from them also maintain the orthogonality condition. The following section denes orthogonal decision trees that makes use of these eigen vectors. Orthogonal Decision Trees The analysis presented in the previous sections offers a way to construct the Fourier spectra of a set of functions that are orthogonal to each other and therefore redundancyfree. These functions also dene a basis and can be used to represent any given decision tree in the ensemble in the form of a linear combination. Orthogonal decision trees can be dened as an immediate extension of this framework. A pair of decision trees and are orthogonal to each other if and only if when and otherwise. The second condition is actually a slightly special case of orthogonal functionsorthonormal condition. A set of trees are pairwise orthogonal if every possible pair of members of this set satisfy the orthogonality condition. The orthogonality condition guarantees that the representation is not redundant. These orthogonal trees form a basis set that spans the entire function space of the ensemble. The overall output of the ensemble is computed from the output of these orthogonal trees. Specic details of the ensemble output computation depends on the adopted technique to compute the overall output of the original ensemble. However, for most popular cases considered here boils down to computing the average output. If we choose to go for weighted averages, we may also compute the coefcients corresponding to each by simply performing linear regression.

3.2.5 Experimental Results


This section reports the experimental performance of orthogonal decision trees on the following data sets - SPECT, NASDAQ, DNA, House of Votes and Contraceptive Method Usage Data. For each data set, the following three experiments are performed using known classication techniques: 1. C4.5: The C4.5 classier is built on training data and validated over test data. 2. Bagging: A popular ensemble classication technique, bagging, is used to test the classication accuracy of the data set. 3. Random Forest: Random forests are built on the training data, using approximately half the number of features in the original data set. The number of trees in the forest is identical to that used in the bagging experiment5. We then perform another set of experiments for comparing the techniques described in the previous sections in terms of error in classication and tree complexity.
5 We used the WEKA implementation(http://www.cs.waikato.ac.nz/ml/weka/) of Bagging and Random Forests

5 ) ( C  5 ) ( A u S 1A&$C1Af10 5 )( %
55

l S j 5 )(  10p

S 10&B$B10fB 5 )( C  5 )( A
D E

1. Reconstructed Fourier Tree(RFT): The training set is uniformly sampled, with replacement and C4.5 trees are built on each sample. The Fourier representation of each individual tree is obtained preserving a certain percentage(e.g. 90%) of the energy. This representation of a tree is used to reconstruct a decision tree using the TCFS algorithm described in Section 3.2.3. The performance of a reconstructed Fourier tree is compared with the original C4.5 tree. The error in classication and tree complexity of each of the reconstructed trees is reported. The purpose of this experiment is to study the effect of representation of a tree by its Fourier spectrum and how much accuracy is lost in the entire cycle of summarizing a decision tree by its Fourier spectrum and then re-learning the tree from the spectrum. 2. Aggregated Fourier Tree(AFT): The training set is uniformly sampled, with replacement and C4.5 decision trees are built on each sample (This is identical to bagging). A Fourier representation of each tree is obtained(preserving a certain percentage of the total energy), and these are aggregated with uniform weighting, to obtain the spectrum of an Aggregated Fourier Tree (AFT). The AFT is reconstructed using the TCFS algorithm described before and the classication accuracy and the tree complexity of this aggregated Fourier tree is reported. 3. Orthogonal Decision Trees: The matrix containing the Fourier coefcients of the decision trees is subjected to principal component analysis. Orthogonal trees are built, corresponding to the principal components. In most cases it is found that that the rst principal component captures most of the variance, and thus the orthogonal decision tree constructed from this principal component is of particular interest. So we report the error in classication and tree complexity of the orthogonal decision tree obtained from the rst principal component. We also perform experiments where we keep k6 signicant components.The trees are combined by weighting them according to the coefcients obtained from a Least Square Regression7 . Each orthogonal decision tree is weighted using coefcients calculated from Least Square Regression. For this, we allow all the orthogonal decision trees to individually produce their classication on the test set. Thus each ODT produces a column vector of its classication estimate. Since the class-labels in the test set are already known, we use the least square regression to obtain the weights to assign to each ODT. The accuracy of the orthogonal decision trees is reported as ODT-LR(ODTs combined using Least Square Regression). In addition to reporting the error in classication, we also report the tree complexity, the total number of nodes in the tree. Similarly, the term ensemble complexity reects the total number of nodes in all the trees in the ensemble. A smaller ensemble tree complexity implies a compact representation of an ensemble and therefore it is desirable. Our experiments show that ODTs usually offer signicantly reduced ensemble
6 We select the value of k in such a manner that the total variance captured is more than 90%. One could potentially do cross-validation to obtain a suitable value of k as pointed out in [189] but this is beyond the current scope of the work and will be explored in future. 7 Several other regression techniques such as ridge regression, principal component regression can also be tried. This is left as future work

56

tree complexity without any reduction in the accuracy. The following section presents the results for the SPECT data set. SPECT Data set This section illustrates the idea of orthogonal decision trees using a well known binary data set. The dataset, available from the University of California Irvine, Machine Learning Repository, describes diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images into two categories, normal or abnormal. The database of 267 SPECT image sets (patients) is processed to extract features that summarize the original SPECT images. As a result, 44 continuous feature patterns are obtained for each patient, which are further processed to obtain 22 binary feature patterns. The training data set consists of 80 instances and 22 attributes. All the features are binary, and the class label is also binary (depending on whether a patient is deemed normal or abnormal). The test data set consists of 187 instances and 22 attributes. Method of classication C4.5 Bagging Random Forest Aggregated Fourier Tree (AFT) ODT from 1st PC ODT-LR Error Percentage 24.5989 (%) 20.85 (%) 22.99466 (%) 19.78(%) 8.02(%) 8.02(%)

Table 3.1: Classication error for SPECT data Method of classication C4.5 Bagging (average of 40 trees) Random Forest (average of 40 trees) Aggregated Fourier Tree(AFT)(40 trees) Orthogonal Decision Tree from 1st PC Orthogonal Decision Trees (average of 15 trees) Tree Complexity 13 5.06 49.67 3 17 4.3

Table 3.2: Tree complexity for SPECT data. Table 3.3 shows the error percentage obtained in each of the different classication schemes. The root mean squared error for the 10 fold cross validation in the C4.5 experiment is found to be 0.4803 and the standard deviation is 2.3862. For Bagging, the number of trees in the ensemble is chosen to be forty. Our experiments reveal that further increase in number of trees in the ensemble causes a decrease in accuracy of classication of the ensemble possibly due to over-tting of the data. For experiments with Random Forests, forest of 40 trees, each constructed while considering 12 random features is built. The average Out of bag error is reported to be 57

100 90 80 70 Accuracy of Classification 60 50 40 30 20 10 0 C4.5 RFT

15

C4.5 RFT

10 Tree Complexity 5 0 5 10 15 20 25 ith Tree in the ensemble 30 35 40 45 0 0

10

15

20 25 ith tree in the ensemble

30

35

40

45

Figure 3.7: The accuracy and tree complexity of C4.5 and RFT for SPECT data 0.3245. The Figure 3.7(Left) compares the accuracy of the original C4.5 ensemble with that of the Reconstructed Fourier Tree(RFT) ensemble preserving 90% of the energy of the spectrum. The results reveal that if all of the spectrum is preserved, the accuracy of the original C4.5 tree and RFT are identical. When the higher order Fourier coefcients are removed, this becomes equivalent to pruning a decision tree. This explains the higher accuracy of the reconstructed Fourier tree preserving 90% of the energy of the spectrum. The Figure 3.7(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. In order to construct the orthogonal decision trees, the coefcient matrix is projected onto the rst fteen most signicant principal components. The most signicant principal component captures 85.1048% of the variance and the tree complexity of the ODT constructed from this component is 17 with an accuracy of 91.97%. The Figure 3.8 shows the variance captured by all the fteen principal components. Table 3.2 illustrates the tree complexity for this data set. The orthogonal trees are found to be smaller in complexity, thus reducing the complexity of the ensemble. NASDAQ Data set The NASDAQ data set is a semi-synthetic data set with 1000 instances and 100 discrete attributes. The original data set has three years of NASDAQ stock quote data. It is preprocessed and transformed to discrete data by encoding percentages of changes in stock quotes between consecutive days. For these experiments we assign, 4 discrete values, that denote levels of changes. The class labels, predict whether the Yahoo stock is likely to increase or decrease based on attribute values of the 99 stocks. We randomly select 200 instances for training and the remaining 800 instances forms the test data set. Table 3.3 illustrates the classication accuracies of different experiments performed

58

90

80

70 Percentage of variance captured

60

50

40

30

20

10

Principle Component

10

15

Figure 3.8: Percentage of variance captured by principal components for SPECT Data. Method of classication C4.5 Bagging Random Forest Aggregated Fourier Tree(AFT) ODT from 1st PC ODT-LR Error Percentage 24.63 (%) 32.75 (%) 25.75 (%) 34.51 (%) 31.12(%) 31.12(%)

Table 3.3: Classication error for NASDAQ data Method of classication C4.5 Bagging (average of 60 trees) Random Forest (average of 60 trees) Aggregated Fourier Tree(AFT) (60 trees) Orthogonal Decision Tree from 1st PC Orthogonal Decision Trees (average of 10 trees) Tree Complexity 29 17 45.71 15.2 3 6.2

Table 3.4: Tree Complexity for NASDAQ data.

on this data set. The root mean squared error for the 10 fold cross validation in the C4.5 experiment is found to be 0.4818 and the standard deviation is 2.2247. C4.5 has the best classication accuracy, though the tree built has the highest tree complexity also. For the bagging experiment, C4.5 trees are built on the dataset, such that the size of each bag (used to build the tree) as a percentage of the data set is 40%. Also, Random forest of 60 trees, each constructed while considering 50 random features is built on the

59

training data and tested with the test data set. The average out of bag error is reported to be 0.3165.
100 90 80 30 70 Accuracy of Classification 60 50 40 30 10 20 10 0 5 25 Tree Complexity C4.5 RFT 40 C4.5 RFT

35

20

15

10

20

30 40 ith tree in the ensemble

50

60

70

10

20

30 40 ith tree in the ensemble

50

60

70

Figure 3.9: The accuracy and tree complexity of C4.5 and RFT for Nasdaq data Figure 3.9(Left) compares the accuracy of the original C4.5 ensemble with that of the Reconstructed Fourier Tree(RFT) ensemble preserving 90% of the energy of the spectrum. Figure 3.9(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. For the orthogonal trees, we project the data along the rst 10 most signicant principal components. The Figure 3.10 illustrates the percentage of variance captured
70

60

50 Percentage of variance

40

30

20

10

5 6 Principle Component

10

Figure 3.10: Percentage of variance captured by principal components for Nasdaq Data.

60

by the ten most signicant principal components. Table 3.4 presents the tree-complexity information for this set of experiments. Both the aggregated Fourier tree and the orthogonal trees performed better than the single C4.5 tree or bagging. The tree-complexity result appears to be quite interesting. While a single C4.5 tree had twenty nine nodes in it, the orthogonal tree from the rst principal component requires just three nodes, which is clearly a much more compact representation. DNA Data Set The DNA data set8 is a processed version of the corresponding data set available from UC Irvine repository. The processed StatLog version replaces the symbolic attribute values representing the nucleotides (only A,C,T,G) by 3 binary indicator variables. Thus the original 60 symbolic attributes are changed into 180 binary attributes. The nucleotides A,C,G,T are given indicator values as follows: . The data set has three class values 1, 2, and 3 corresponding to exonintron boundaries (sometimes called acceptors), intron-exon boundaries (sometimes called donors), and the case when neither is true. We further process the data such that, there are are only two class labels i.e. class 1 representing either donors or acceptors, while class 0 representing neither. The training set consists of 2000 instances and 180 attributes of which 47.45% belongs to class 1 while the remaining 52.55% belongs to class 0. The test data set consists of 1186 instances and 180 attributes of which 49.16% belongs to class 0 while the remaining 50.84% belongs to the class 1. Table 3.5 reports the classication error. The root mean squared error for the 10 fold cross validation in the C4.5 experiment is found to be 0.2263 and the standard deviation is 0.6086 . Also, Random forest of 10 trees, each constructed while considering 8 random features is built on the training data and tested with the test data set. The average out of bag error is reported to be 0.2196. Method of classication C4.5 Bagging Random Forest Aggregated Fourier Tree(AFT) ODT from 1st PC ODT-LR Error Percentage 6.4924 (%) 8.9376(%) 4.595275 (%) 8.347(%) 10.70(%) 10.70(%)

Table 3.5: Classication error for DNA data It may be interesting to note, that the rst ve eigenvectors are used in this experiment. The Figure 3.11 shows the variance captured by these components. As before, the redundancy free trees are combined by the weights obtained from Least Square Regression. Table 3.6 reports the tree complexity for this data set.
8 Obtained

from http://www.liacc.up.pt/ML/statlog/datasets/dna

61

S 5 S  F u wDk u S 

4 S  u 4

Method of classication C4.5 Bagging (average of 10 trees) Random Forest (average of 10 trees) Aggregated Fourier Tree(AFT)(10 trees) Orthogonal Decision Tree from 1st PC Orthogonal Decision Trees (average of 5 trees)

Tree Complexity 131 34 701.22 3 25 7.4

Table 3.6: Tree Complexity for DNA data.


80

70

60 Percentage of variance captured

50

40

30

20

10

3 Principle Components

Figure 3.11: Percentage of variance captured by principal components for DNA Data. Figure 3.12(Left) compares the accuracy of the original C4.5 ensemble with that of the Reconstructed Fourier Tree(RFT) ensemble preserving 90% of the energy of the spectrum. Figure 3.12(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. House of Votes Data Set The 1984 United States Congressional Voting Records Database is obtained from the University of California, Machine Learning Repository. This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identied by the CQA including water project cost sharing, adoption of budget resolution, mx-missile, immigration etc. It has 435 instances, 16 boolean valued attributes and a binary class label(democrat or republican).Our experiments use the rst 335 instances for training and the remaining 100 instances for testing. In our experiments, missing values in the data are replaced by one. The results of classication are shown in the Table 3.7 while the tree complexity is shown in Table 3.8. The root mean squared error for the 10 fold cross validation in

62

90

30

C4.5 RFT

80

C4.5 RFT

25

70

Accuracy of Classification

60 Tree Complexity 0 1 2 3 4 5 6 7 ith tree in the ensemble 8 9 10 11 12

20

50

15

40

30

10

20 5 10

5 6 7 ith tree in the ensemble

10

11

12

Figure 3.12: The accuracy and tree complexity of C4.5 and RFT for DNA data the C4.5 experiment is found to be 0.2634 and the standard deviation is 0.3862. For Bagging, fteen trees are constructed using the dataset, since this produced the best classication results. The size of each bag was 20% of the training data set. Random Forest of fteen trees, each constructed by considering 8 random features produces an average out of bag error of 0.05502. The accuracy of classication and the tree complexity of the original C4.5 and RFT ensemble are illustrated in the left and right hand side of Figure 3.13 respectively. For orthogonal trees, the coefcient matrix is projected onto the rst ve most signicant principal components. The Figure 3.14(Left) illustrates the amount of variance captured by each of the principal components. Method of classication C4.5 Bagging Random Forest Aggregated Fourier Tree(AFT) ODT from 1st PC ODT-LR Error Percentage 8.0 (%) 11.0(%) 5.6(%) 11(%) 11(%) 11(%)

Table 3.7: Classication error for House of Votes data.

Contraceptive Method Usage Data Set This dataset,is obtained from the University of California Irvine, Machine Learning Repository and is a subset of the 1987 National Indonesia Contraceptive Prevalence 63

100 90 80 70 Accuracy of Classification 60 50 40 30 20 10 0

C4.5 RFT

5 6 7 8 9 10 11 12 ith tree in the ensemble ( i varies from 1 to 15)

13

14

15

16

12

C4.5 RFT

10

8 Tree Complexity

5 6 7 8 9 10 11 12 ith tree in ensemble (i varies from 1 to 15)

13

14

15

16

Figure 3.13: The accuracy and tree complexity of C4.5 and RFT for House of Votes data Survey. The samples are married women who are either not pregnant or do not know if they are at the time of interview. The problem is to predict the current contraceptive method choice of a woman based on her demographic and socio-economic characteristics. There are 1473 instances, and 10 attributes including a binary class label. All attributes are processed so that they are binary. Our experiments use 1320 instances for the training set while the rest form the test data set. The results of classication are tabulated in the Table 3.9 while Table 3.10 shows the tree complexity. The root mean squared error for the 10 fold cross validation in the C4.5 experiment is found to be 0.5111 and the standard deviation is 1.8943. Random Forest built with 10 trees, considering 5 random features produces an average error in classication of about 45.88% and an average out of bag error of 0.42556. Figure 3.15(Left) compares the accuracy of the original C4.5 ensemble with that of the Reconstructed Fourier Tree(RFT) ensemble preserving 90% of the energy of the spec-

64

Method of classication C4.5 Bagging (average of 15 trees) Random Forest (average of 15 trees) Aggregated Fourier Tree (AFT)(15 trees) Orthogonal Decision Tree from 1st PC Orthogonal Decision Trees (average of 5 trees)

Tree Complexity 9 5.266 37.42 5 5 3

Table 3.8: Tree Complexity for House of Votes data.


90 70

80

60

70 Percentage of variance captured Percentage of variance captured 1 2 3 Principle Components 4 5 50

60

50

40

40

30

30

20

20 10

10

4 5 6 7 Siginificant Principle Components

10

Figure 3.14: Percentage of variance captured by principal components for (Left) House of Votes Data and (Right) Contraceptive Method Usage data. trum. Figure 3.15(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. For ODTs, the data is projected along the rst ten principal components. The Figure 3.14(Right) shows the amount of variance captured by each principal component. It is interesting to note that the rst principal component captures only about 61.85% of the variance and thus the corresponding ODT generated from the rst principal component has a relatively high tree complexity.

3.3 DDM on Data Streams


3.3.1 Introduction
Several challenging new applications demand the ability to do data mining on resource constrained devices. One such application is that of monitoring physiological data streams obtained from wearable sensing devices. Such monitoring has applications 65

Method of classication C4.5 Bagging Random Forest Aggregated Fourier Tree(AFT) ODT from 1st PC ODT-LR

Error Percentage 49.6732(%) 52.2876(%) 45.88234 (%) 33.98(%) 46.40(%) 46.40(%)

Table 3.9: Classication error for Contraceptive Method Usage Data. Method of classication C4.5 Bagging(average of 10 trees) Random Forest (average of 10 trees) Aggregated Fourier Tree(AFT)(10 trees) Orthogonal Decision Tree from 1st PC Orthogonal Decision Trees (average of 10 trees) Tree Complexity 27 24.8 298.11 55 15 6.6

Table 3.10: Tree Complexity for Contraceptive Method Usage Data.


70 C4.5 RFT 60 25 30

C4.5 RFT

50 Accuracy in Classification 20 40 Tree Complexity 0 1 2 3 4 5 6 7 8 9 ith tree in the ensemble (i varies from 1 to 10) 10 11 12

15

30

10 20

10

4 5 6 7 8 9 ith tree in the ensemble (i varies from 1 to 10)

10

11

12

Figure 3.15: The accuracy and tree complexity of C4.5 and RFT for Contraceptive Method Usage data for pervasive healthcare management, be it for seniors, emergency response personnel, soldiers in the battleeld or atheletes. A key requirement is that the monitoring system be able to run on resouce constrained handheld or wearable devices. Orthogonal decision trees(ODTs) (introduced in section 2.3.2) offer an effective way to con-

66

struct a redundancy-free, accurate, and meaningful representation of large decisiontree-ensembles often created by popular techniques such as Bagging, Boosting, Random Forests and many distributed and data stream mining algorithms. This section discusses various properties of the ODTs and their suitability for monitoring physiological data streams in a resource-constrained environment. It offers experimental results to document the performance of orthogonal trees on grounds of accuracy, model complexity, and other characteristics in a resource-constrained mobile environment. In closing, we argue that this application will have signicant benets if integrated with a grid infrastructure. Physiological Data Stream Monitoring We draw two scenarios to illustrate the potential uses of physiological data stream monitoring. Both cases involve a situation where a potentially complex decision space has to be examined, and yet the resources available on the devices that will run the decision process are not sufcient to maintain and use ensembles. Consider a real time environment to monitor the health effects of environmental toxins or disease pathogens on humans. There are signicant advances being made today in biochemical engineering to create extremely low cost sensors for various toxins[162] that could constantly monitor the environment and generate data streams over wireles networks. It is not unreasonable to assume that similar sensors could be developed to detect disease causing pathogens. In addition, most state health/environmental agencies and the federal government entities such as CDC and EPA have mobile labs and response units that can test for the presence of pathogens or dangerous chemicals. The mobile units will have handheld devices with wireless connections on which to send the data and/or their analysis. In addition, each hospital today generates reports on admissions and discharges, and often reports that to various monitoring agencies. Given these disparate data streams, one could analyze them to see if correlates can be found, alerting experts to potential cause-effect relations (Pesteria found in Chesapeake Bay and hostpitals report many people with upset stomach who had seafood recently), potential epedemiological events (eld units report dead infected birds and elederly patients check in with viral fever symptoms, indicating tests needed for west nile virus and preventive spraying), and more pertinent in present times, low grade chemical and biological attacks (sensors detect particular toxins, mobile units nd contaminated sites, hospitals show people who work at or near the sites being admitted with unexplained symptoms). At present, much of this analysis is done post facto experts hypothesize on possible causes of ailments, then gather the data from disparate sources to conrm their hypotheses. Clearly, a more proactive environment which could mine these diverse data strems to detect emergent patters would be extremely useful. This scenario, of course, has some futuristic elements. On a more present day note, there are now several wearable sensors on the market such as SenseWear armband from BodyMedia [30], Wearable West [272], and LifeShirt Garment from Vivometrics [266] that can be used to monitor vital signs for a person such as temperature, heartrate, heatux, etc.
G % E

67

Figure 3.16: The Body Media SenseWear armband and The Vivometrics Life Shirt Garment The gure 3.169 on the left hand side shows the SenseWear armband that was used to collect the data. The sensors in this band were capable of measuring the following: 1. Heat ux: The amount of heat dissipated by the body. 2. Accelerometer: Motion of the body 3. Galvanic Skin Response: Electrical conductivity between two points on the wearers arm 4. Skin Temperature: Temperature of the skin and is generally reective of the bodys core temperature 5. Near-Body Temperature: Air temperature immediately around the wearers armband. The subjects were expected to wear the armband as they went about their daily routine, and were required to timestamp the beginning and end of an activity. For example, before starting to take a jog, they could press the timestamp button, and when nished, they could press the button again to record the end of the activity. This body monitoring device can be worn continuously, and can store up to 5 days of physiological data before it had to be retrieved. The LifeShirt Garment is yet another example of an easy to wear shirt, that allows measurement of pulmonary functions via sensors woven into the shirt. The gure 3.16 on the right hand side shows the heart monitor. Subjects are capable of measuring symptoms, moods, activities and several other physiological characteristics.
9 The gures are http://www.vivometrics.com

obtained

from

http://www.cs.utexas.edu/users/sherstov/pdmc/

and

68

Analysing these vital signs in real time using small form factor wearable computers has several valuable near term applications. For instance, one could monitor senior citizens living in assisted or independent housing, to alert physicians and support personnel if the signs point to distress. Similarly, one could monitor athletes during games or practice. Given the recent high proe deaths of athletes both at the professional and high school levels during practice, the importance of such an application is fairly apparent. Other potential applications include battleeld monitoring of soldiers, or monitoring rst responders such as reghters.

3.3.2 Experimental Results


In order to perform on line monitoring of physiological data using wearable or handheld (PDAs, cellphones) devices, data streams are sent to them from sensors using short range wireless networks such as PANs. Precomputed(based on training data obtained previously) orthogonal decision trees and bagging ensembles are kept on these devices. The data streams are classied using these precomputed models, which are updated on a periodic basis. It must be noted that while the monitoring is in real time, the model computation is done off-line using stored data. This section documents the performance of orthogonal decision trees on a physiological data set. It makes use of publicly available data set in order to offer benchmarked results. This dataset10 was obtained from the Physiological Data Modeling Contest11 held as part of the International Conference on Machine Learning, 2004. It comprises of several months of data from more than a dozen subjects and was collected using BodyMedia12 wearable body monitors. In our experiments, the training set consisted of 50,000 instances and 11 continuous and discrete valued attributes13 . The test set had 32,673 instances. The continous valued attributes were discretized using the WEKA software14 . The nal training and test data sets had all discrete valued attributes. A binary classication problem was formulated, which monitored whether an individual was engaged in a particular activity(class label=1) or not(class label=0) depending on the physiological sensor readings. C4.5 decision trees were built on data blocks of size 150 instances and the classication accuracy and tree complexity was noted. These were then used to compute their Fourier spectra and the matrix of the Fourier coefcients was subjected to principle component analysis. Orthogonal trees were built, corresponding to the signicant components and they were combined using an uniform aggregation scheme. The accuracy and size of the orthogonal trees are noted and compared with the corresponding results generated by Bagging using the same number of decision trees in the ensemble. The Figure 3.17 illustrates four decision trees built on the uniformly sampled training data set(each of size 150). The rst decision tree, has a complexity 7 and considers
11 http://www.cs.utexas.edu/users/sherstov/pdmc/ 12 http://www.bodymedia.com/index.jsp 10 Obtained

from http://www.cs.utexas.edu/users/sherstov/pdmc/

13 The attributes used for the classication experiments were gender, galvanic skin temperature, heat ux, near body temperature, pedometer, skin temperature, readings from the longitudinal and transverse accelerometer and time for recording an activity called session time 14 http://www.cs.waikato.ac.nz/ml/weka/

69

Figure 3.17: Decision Trees built from four different samples of the physiological data set

Figure 3.18: An Orthogonal Decision Tree

attribute transverse acceleromenter reading, session time and near body temperature as ideal for splits. Before pruning, only two instances are mis-classied giving an error of 1.3(%). After pruning, there is no change in structure of the tree. The estimated error percentage is 4.9(%). The second, third and fourth decision trees have complexities 5, 7, and 3 respectively. An illustration of an orthogonal decision tree obtained from the rst principle component, is shown in Figure 3.18. Figure 3.19 illustrates the distribution of tree complexity and error in classication for the original C4.5 trees used to construct an ODT ensemble. The total number of nodes in the original C4.5 trees varied between three and thirteen. The trees had

70

20 18 16 14 Distribution in ensemble 12 10 8 6 4 2 0

Distribution of tree size in ensemble used to construct an orthogonal decision tree

25

Histogram of tree error in ensemble

20

15 Distribution 10 5 3 4 5 6 7 8 9 10 11 Size of a decision tree (measured by number of nodes in tree) 12 13 0 0

10 15 Error in classification of trees in ensemble

20

25

Figure 3.19: Histogram of tree complexity (left) and error (right) in classication for the original C4.5 trees.
35 Distribution of the error in classification for the ODT

30

25

Distribution

20

15

10

10

15

20 25 Error in classification

30

35

40

Figure 3.20: Histogram of error in classication in the ODT ensemble.

an error of less than 25(%). In comparison, the average complexity of the orthogonal decision trees was found to be 3 for all the different ensemble sizes. In fact, for this particular dataset, the sensor reading corresponding to transverse accelerometer attribute was found to be the most interesting. All the orthogonal decision trees used this attribute as the root node for building the trees. The Figure 3.20 illustrates the distribution of error in classication for an ODT ensemble of 75 trees.

71

Figure 3.21: Comparison of error in classication for trees in the ensemble for aggregated ODT versus Bagging.

Figure 3.22: Plot of Tree Complexity Ratio versus number of trees in the ensemble.

We compared the accuracy obtained from an aggregated orthogonal decision tree to that obtained from a bagging ensemble(using the same number of trees in each case). Figure 3.21 plots the error in classication of the aggregated ODT and bagging versus the number of decision trees in the ensemble. We found that the classication from an aggregated orthogonal decision tree was better than bagging when the number of trees in the ensemble was smaller. With increase in number of trees in the ensemble bagging provided a slightly better accuracy. It must be noted however, that in constrained environments such as in pocket pcs, personal assistants and sensor network setting, increasing the number of trees in the ensemble arbitarily may not be feasible due to memory constraints.

72

Figure 3.23: Variance captured by the rst principle component versus number of trees in ensemble.

In resource constrained environments it is often necessary to keep track of the amount of memory used to store the ensemble. In the current implementation storing a node data structure in a tree requires approximately 1 KB of memory. Consider an ensemble of 20 trees. If the average number of nodes in the trees in the ensemble is 7, then we are required to store 140 KB of data. Orthogonal decision trees on the other hand are smaller in size, with less redundancy. In the experiments we performed they typically have a complexity of 3 nodes. This means that we need to store only 3 KB of data. We dene Tree Complexity Ratio (TCR) as the total number of nodes in the ODT versus the total number of nodes in the bagging ensemble. Figure 3.21 plots the variation of the TCR as the number of trees in the ensemble increases. It may be noted that in resource constrained environments one can opt for meaningful trees of smaller size and comparable accuracy as opposed to larger ensembles with a slightly better accuracy. An orthogonal decision tree also helps in the feature selection process and indicates which attributes are more important than others in the data set. The Figure 3.23 indicates the variance captured by the rst principle component as the number of trees in the ensemble was varied from 5 to 75 trees. As is expected, as the number of trees in the ensemble increases, the rst principle component captures most of the variance and those occupied by the second and third components gradually decreases. The following section illustrates the response time for classication on a pocket pc using a bagging ensemble and an equivalent orthogonal decision tree ensemble.

73

Figure 3.24: Plot of Response time for Bagging and equivalent ODT ensemble versus the number of trees in the ensemble.

3.3.3 Monitoring in Resource Constrained Environments


Resource Constrained environments such as personal digital assistants, pocket pcs, cell phones are often used to monitor the physiological conditions of subjects. These devices present additional challenges in monitoring owing to the limited battery power, memory restrictions and small displays that they have. The previous section indicated that an aggregated orthogonal decision tree was small in size, and captured an accuracy better or comparable to that of bagging when the ensemble size was small. Although bagging was found to perform better in larger ensembles, the number of trees that needed to be stored was considerably larger and clearly not an option in the resource constrained environments. Therefore a tradeoff exists between the memory usage and accuracy. In order to test the response time for monitoring, we performed classication experiments on an HP iPAQ Pocket PC. We assumed that physiological data blocks of size 40 instances were sent to the handheld device. Using training data obtained previously, we precomputed C4.5 decision trees. The Fourier spectra of the trees were evaluated(preserving approximately 99(%) of the total energy) and the coefcient matrix was projected onto the most signicant principal components. Since the time required for computation is of considerable importance in resource constrained environments, we estimated the response time for Bagging ensemble versus the equivalent ODT ensemble. We dene response time as the time required to produce an accuracy estimate from all the instances available by the specied classication scheme. The Figure 3.24 illustrates the response time for a bagging ensemble and an equivalent ODT ensemble. Clearly the equivalent orthogonal decision tree produces classication results faster than a bagging ensemble and this may be attributed to the fact that much of the redundancy in bagging ensemble has been removed in the ODT ensemble. Our method thus offers a computationally efcient method for classication on resource constrained devices.

74

3.3.4 Grid Based Physiological Data Stream Monitoring - A Dream or Reality ?


In the previous sections, we describe an application of a distributed data mining technique (Orthogonal Decision Tree ensembles) for monitoring physiological data streams in time-critical resource-constrained environments. It is shown that ODTs offer an effective way to construct redundancy-free ensembles that are easier to understand and apply. They are particularly useful in monitoring data streams using resource constrained platforms where storage and CPU computing power are limited but fast response is important. ODTs are constructed from the Fourier spectra of the decision trees in the ensemble. Redundancy is removed from the ensemble by performing a PCA of these Fourier spectra. This offers an efcient representation of the ensemble, often needed for fast response in many real-time data mining applications. This also allows a meaningful way to visualize the trees in a low dimensional space. There is a rising trend for analysis of physiological data obtained from wearable devices and sensors (such as The Biomedical Informatics Group at Nottingham University15 , The Computer Assisted Reporting16 of Electro Cardio Grams (ECG). These applications need to facilitate the management of experimental scenarios for health monitoring and provide easy access to historical or real-time streaming data for analysis. Furthermore, the data is inherently distributed, since users of the medical devices and data miners are typically not located at the same place. In some cases it becomes necessary to make the data available through grid facilities for remote usage by doctors and other medical professionals. It is also important to allow users of the grid to directly interact with remote devices (e.g. vests or jacket having physiological data monitoring sensors, armbands recording galvanic skin temperature). We have shown that it is possible to do sophisticated distributed data analysis on resource-constrained devices. However, porting such devices on the grid is an active area of research. The only related work known to us at this time has been done as part of the Equator project at the University of Nottingham17. Even this work makes assumptions such as the monitoring devices are available only through grid services and thus provides a simulated environment. While it serves as an interesting proof of concept and motivates further research, it is apparent that physiological data stream monitoring on the grid still has a long road to go. In the following sections we examine the feasibility of doing distributed data mining on federated astronomy catalogs.

3.4 DDM on Federated Databases


3.4.1 The National Virtual Observatory
There are several instances in the astronomy and space sciences research communities where data mining is being applied to large data collections [76, 196]. Some dedicated
15 http://www.eee.nott.ac.uk/medical/ 16 http://www.gla.ac.uk/care/ 17 http://www.equator.ac.uk/index.php/articles/c70/

75

data mining projects include F-MASS [98], Class-X [67], the Auton Astrostatistics Project [16], and additional VO-related data mining activities (such as SDMIV [241]). In essentially none of these cases does the project involve truly DDM [187]. Through a past NASA-funded project, K. Borne applied some very basic DDM concepts to astronomical data mining [33]. However, the primary accomplishments focused only on centralized co-location of the data sources [32, 31]. One of the rst large-scale attempts at grid data mining for astronomy is the U.S. National Science Foundation (NSF) funded GRIST [116] project. The GRIST goals include application of grid computing and web services (service-oriented architectures) to mining large distributed data collections. GRIST is focused on one particular data modality: images. Hence, GRIST aims to deliver mining on the pixel planes within multiple distributed astronomical image collections. The project that we are proposing here is aimed at another data modality: catalogs (tables) of astronomical source attributes. GRIST and other projects also strive for exact results, which usually requires data centralization and co-location, which further requires signicant computational and communications resources. DEMAC (our system) will produce approximate results without requiring data centralization (low communication overhead). Users can quickly get (generally quite accurate) results for their distributed queries at low communication cost. Armed with these results, users can focus in on a specic query or portion of the datasets, and down-load for more intricate analysis. The U.S. National Virtual Observatory (NVO) [203] is a large scale effort funded by the NSF to develop a information technology infrastructure enabling easy and robust access to distributed astronomical archives. It will provide services for users to search and gather data across multiple archives and some basic statistical analysis and visualization functions. It will also provide a framework for new services to be made available by outside parties. These services can provide, among other things, specialized data analysis capabilities. As such, we envision DEMAC to t nicely into the NVO as a new service. The Virtual Observatory can be seen as part of an ongoing trend toward the integration of information sources. The main paradigm used today for the integration of these data systems is that of a data grid [91, 115, 140, 225, 92, 263, 195, 28]. Among the desired functionalities of a data grid, data analysis takes a central place. As such, there are several projects [84, 159, 112, 116, 125, 129], which in the last few years attempt to create a data mining grid. In addition, grid data mining has been the focus of several recent workshops [158, 85]. DDM is a relatively new technology that has been enjoying considerable interest in the recent past [214, 156]. DDM algorithms strive to analyze the data in a distributed manner without down-loading all of the data to a single site (which is usually necessary for a regular centralized data mining system). DDM algorithm naturally fall into two categories according to whether the data is distributed horizontally (with each site having some of the tuples) or vertically (with each site having some of the attributes for all tuples). In the latter case, it is assumed that the sites have an associated unique id used for matching. In other words, consider a tuple and assume site has a part of this tuple, , and has the remaining part . Then, the id associated with equals
H

76

P F

the id associated with .18 The NVO can be seen as a case of vertically distributed data, assuming ids have been generated by a cross-matching service. With this assumption, DDM algorithms for vertically partitioned data can be applied. These include algorithms for principal component analysis (PCA) [157, 155], clustering [145, 157], Bayesian network learning [62, 63], and supervised classication [49, 108, 123, 215, 262].

3.4.2 Data Analysis Problem: Analyzing Distributed Virtual Catalogs


We illustrate the problem with two archives: the Sloan Digital Sky Survey (SDSS) [242] and the 2-Micron All-Sky Survey (2MASS) [3]. Each of these has a simplied catalog containing records for a large number of astronomical point sources, upward of 100 million for SDSS and 470 million for 2MASS. Each record contains sky coordinates (ra,dec) identifying the sources position in the celestial sphere as well as many other attributes (460+ for SDSS; 420+ for 2MASS). While each of these catalogs individually provides valuable data for scientic exploration, together their value increases signicantly. In particular, efcient analysis of the virtual catalog formed by joining these catalogs would enhance their scientic value signicantly. Henceforth, we use "virtual catalog" and "virtual table", interchangeably. To form the virtual catalog, records in each catalog must rst be matched based on their position in the celestial sphere. Consider record from SDSS and from 2MASS with sky coordinates and . Each record represents a set of observations about an astronomical object e.g. a galaxy. The sky coordinates are used to determine if and match, i.e. are close enough that and represents the same astronomical object. The issue of how matching is done will be discussed later. For each match , the result is a record in the virtual catalog with all of the attributes of and . As described earlier, the virtual catalog provides valuable data that neither SDSS or 2MASS alone can provide. DEMAC addresses the data analysis problem of developing communication-efcient algorithms for analyzing user-dened subsets of virtual catalogs. The algorithms allow the user to specify a region in the sky and a virtual catalog, then efciently analyze the subset of tuples from that catalog with sky coordinates in . Importantly, the algorithms we propose do not require that the base catalogs rst be centralized and the virtual catalog explicitly realized. Moreover, the algorithms are not intended to be a substitute for exact, centralization-based methods currently being developed as part of the NVO. Rather, they are intended to complement these methods by providing, quick, communication-efcient approximate results to allow browsing. Such browsing will allow the user to better focus their exact, communication-expensive, queries. Example 3 The all data release of 2MASS contains attribute, K band means surface brightness (Kmsb). Data release four of SDSS contains galaxy attributes red shift (rs), petrosian I band angular effective radius (Iaer) and velocity dispersion (vd).
18 Each id is unique to the site at which it resides; no two tuples at site have the same id. But, ids can match across sites; a tuple at site can have the same id as a tuple at site .

77

Ro  j

U S VT

Qo B  j

5 C0( 

To produce a physical variable, consider composite attribute petrosian I band effective radius (Ier) formed by the product of Iaer and rs. Note, since Iaer and rs are both at the same repository (SDSS), then, from the standpoint of distributed computation, we may assume Ier is contained in SDSS. A principal component analysis over a region of sky on the virtual table with columns log(Ier), log(vd), and Kmsb is interesting in that it can allow the identication of a fundamental plane (the logarithms are used to place all variables on the same scale). Indeed, if the rst two principal components capture most of the variance, then these two variables dene a fundamental plane. The existence of such things points to interesting astrophysical behaviors. We develop a communication-efcient distributed algorithm for approximating the principal components of a virtual table.

3.4.3 The DEMAC system


This section describes the high level design of the proposed DEMAC system. DEMAC is designed as an additional web-service which seamlessly integrates into the NVO. It consists of two basic services. The main one is a web-service providing DDM capabilities for vertically distributed sky surveys (WS-DDM). The second one, which is intensively used by WS-DDM, is a web-service providing cross-matching capabilities for vertically distributed sky surveys (WS-CM). Cross-matching of sky surveys is a complex topic which is dealt with, in itself, under other NASA funded projects. Thus, our implementation of this web-service would supply bare minimum capabilities which are required in order to provide distributed data mining capabilities. To provide a distributed data mining service, DEMAC would rely on other services of the NVO such as the ability to select and down-load from a sky survey in an SQLlike fashion. Key to our approach is that these services be used not over the web, through the NVO, but rather by local agents which are co-located with the respective sky survey. In this way, the DDM service avoid bandwidth and storage bottlenecks, and overcomes restrictions which are due to data ownership concerns. Agents, in turn, take part in executing efcient distributed data mining algorithms, which are highly communication-efcient. It is the outcome of the data mining algorithm, rather than the selected data table, that is provided to the end-user. With the removal of the network bandwidth bottleneck, the main factor limiting the scalability of the distributed data mining service would be database access. For database access we intend to rely on the SQL-like interface provided to the different sky-surveys to the NVO. We outline here the architecture we propose for the two web-services we will develop.

3.4.4 WS-DDM DDM for Heterogeneously Distributed Sky-Surveys


This web-service will allow running a DDM algorithm (three will be discussed later) on a selection of sky-surveys. The user would use existing NVO services to locate sky-surveys and dene the portion of the sky to be data mined. The user would then use WS-CM to select a cross-matching scheme for those sky-surveys. This species how the tuples are matched across surveys to dene the virtual table to be analyzed. Following these two preliminary phases the user would submit the data mining task. 78

Execution of the data mining task would be scheduled according to resource availability. Specically, the size of the virtual table selected by the user would dictate scheduling. Having allocated the required resources, the data mining algorithm would be carried on by agents which are co-located with the selected sky-surveys. Those agents will access the sky-survey through the SQL-like interface it exposes to the NVO and will communicate with each other directly, over the Internet. When the algorithm has terminated, results would be provided to the user using a web-interface.

3.4.5 WS-CM Cross-Matching for Heterogeneously Distributed Sky-Surveys

Central to the DDM algorithms we develop is that the virtual table can be treated as vertically partitioned (see Section ?? for the denition). To achieve this, match indices are created and co-located with each sky survey. Specically, for each pair of surveys (tables) and , a distinct pair of match indices must be kept, one at each survey. Each index is a list of pointers; both indices have the same number of entries. The entry in list points to a tuple and the entry in list points to such that and match. Tuples in and which do not have a match, do not have a corresponding entry in either index. Clearly, algorithms assuming a vertically partitioned virtual table can be implemented on top of these indices. Creating these indices is not an easy job. Indeed, cross-matching sources is a complex problem for which no single best solution exists. The WS-CM web-service is not intended to address this problem. Instead it will use already existing solutions (e.g., the cross-matching service already provided by the NVO), and it will be designed to allow other solutions to be plugged in easily. Moreover, cross-matching the entirety of two large surveys is a very time-consuming job and would require centralizing (at least) the coordinates of all tuples from both. Importantly, the indices do not need to be created each time a data mining task is run. Instead, provided the sky survey data are static (it generally is), each pair of indices only need be created once. Then any data mining task can use them. In particular the DDM tasks we develop can use them. The net result is the ability to mine virtual tables at low communication cost.

3.4.6 DDM Algorithms: Denitions and Notation


In the next two sections we describe some DDM algorithms to be used as part of the WS-DDM web service. All of these algorithms assume that the participating sites have the appropriate alignment indices. Hence, for simplicity, we describe the algorithms under the assumption that the data in each site is perfectly aligned the tuple of each site match (sites have exactly the same number of tuples). This assumption can be emulated without problems using the matching indices. Let denote an matrix with real-valued entries. This matrix represents a dataset of tuples from . Let denote the column and denote the 79

AF

AF

AF

5F H0( R

Aq

AF

gs f

a `

 j o

entry of this column. Let

Let

denote the sample variance of the

Let

denote the sample covariance of the

and

Note, ). Finally, let denote the covariance matrix of i.e. the matrix whose entry is . Assume this dataset has been vertically distributed over two sites and . Since we are assuming that the data at the sites is perfectly aligned, then has the rst attributes and has the last attributes ( ). Let denote the matrix representing the dataset held by , and denote the matrix representing the dataset held by . Let denote the concatenation of the datasets i.e. . column of is denoted . The In the next two sections, we describe communication-efcient algorithms for PCA and outlier detection on vertically distributed over two sites. Both algorithms easily extend to more than two sites, but, for simplicity, we only discuss the two site scenario. We have also developed a distributed algorithm for decision tree induction (supervised classication). We will not discuss it further and refer the reader to [108] for details. Later examine the effectiveness of the distributed PCA algorithm through a case study on real astronomical data. We leave for future work the job of testing the decision tree and outlier detection algorithms with a case studies. Following standard practise in applied statistics, we pre-process by normalizing so that and . This is achieved by replacing each entry with . Since both and can be computed without any communication, then normalizing can be performed without any communication. Henceforth, we assume and . Let denote the eigenvalues of and the associated eigenvectors (we assume the eigenvectors are column vectors i.e. matrices.) (pairwise orthonormal). The principal direction of is . The principal component is denoted and equals (the projection of along the direction).

3.4.7 Virtual Catalog Principal Component Analysis


PCA is a well-established data analysis technique used in a large number of disciplines: astronomy, computer science, biology, chemistry, climatology, geology, etc. Quoting [146] page 1: "The central idea of PCA is to reduce the dimensionality of a data set

80

Aq R A qf g pd d     s u F& % d   d

I I  R  Aq IS  I  P e f I H e s f  g S s x P e s H 8 P 1 H 8 5 ( d g pf g @ $ R 6rQ6rQ A k$( 5 $ 6rQ!5 6oe 5 q 5 ( d R R ( d S R ( j  H0( 5 c u H0( z 5 6c  5F @ z @ ( 5F R z R (  x

R dw R q Aq 56(rQ d vs | s  u% | t |  s s s (j S 5 R c ( u S5 R 6o8 X f W rA q 5 R ( j 5 R 6c ( 5F H0( R X " f W h X p W i f g S ( j " c" 5 R 6c S ( u c5 R 8

Aq

 HA( z u % 5F R Aq  5F  H0( R x

z 5 R c x ( 

S ( 5 R 6c

5 R 6c (

denote the sample mean of this column i.e. (3.5)

5 @ D R 6rQ ( d

5 R 6 ( j

column i.e. (3.6) columns i.e. (3.7)

consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the dataset." Next we provide a very brief overview of PCA, for a more detailed treatment, the reader is referred to [146]. The principal component, , is, by denition, a linear combination of the columns of the column has coefcient . The sample variance of equals . The principal components are all uncorrelated i.e. have zero pairwise sample covariances. Let ( ) denote the matrix with columns . This is the dataset projected onto the subspace dened by the rst principal directions. If , then is simply a different way of representing exactly the same dataset, because can be recovered completely as where where is the matrix with columns and denotes matrix transpose ( is a square matrix with orthonormal columns, hence equals the identity matrix.) However, if , then is a lossy lower dimensional representation of . The amount of loss is typically quantied as
q Q q hy

the proportion of variance captured by the lower dimensional representation. The larger the proportion captured, the better represents the "information" contained in the original dataset . If is chosen so that a large amount of the variance is captured, then, intuitively, , captures many of the important features of . So, subsequent analysis on can be quite fruitful at revealing structure not easily found by examination of directly. Our case study will employ this idea. To our knowledge, the problem of vertically distributed PCA computation was rst addressed by Kargupta et al. [157] based on sampling and communication of dominant eigenvectors. Later, Kargupta and Puttagunta [155] developed a technique based on random projections. Our method is a slightly revised version of this work. We describe . Clearly, PCA can be performed a distributed algorithm for approximating from without any further communication. Recall that is normalized to have zero column sample mean and unit sample variance. As a result, which is and . Clearly this inner product can be the inner product between computed without communication when and are at the same site (i.e. or ). It sufces to show how the inner product can be approximated across different sites, in effect, how can be approximated. The key idea is based on the following fact echoing the observation made in [199] that high-dimensional random vectors are nearly orthogonal. A similar result was proved elsewhere [12]. Fact 1 Let be an matrix each of whose entries is drawn independently from a distribution with variance one and mean zero. It follows that where is the identity matrix. We will use the Algorithm 3.4.7.1 for computing . The result is obtained at both sites (in the communication cost calculations, we assume a message requires 4 bytes of transmission). 81
I

R &q

q &&pq

X W eP H  W X

x S W W }

s y

g f hg

@  x

"

eP H

q hy

x q hy

5 6Q I ( d

 s x $q u x I R 3  I I @  R  S 5 @ t  & R  2rQ I I ( d

5 ( R k6d S

s s y ry

 | R  R  s R | qR

f 8

q y Ex

R &q

q d

 B&& 

q hy x

g u A
q Ey x d q Ey x

f }

q y hTx

 5 I ( d ArQ

q Ey x

$q u 

Aq

g S f eP

R|

(3.8)

Algorithm 3.4.7.1 Distributed Covariance Matrix Algorithm 1. sends a random number generator seed. [1 message] 2. and generate a random matrix where . Each entry is generated independently and identically from any distribution with mean zero and variance one. sends to ; sends to .[ messages] 3. 4. and compute . Note that,

which, by Fact 1, equals

Hence, on expectation, the algorithm is correct. However, its communication cost (bytes) divided by the cost of the centralization-based algorithm,

. Indeed provides a "knob" for tuning the trade-off between is small if communication-efciency and accuracy. Later, in our case study, we present experiments measuring this trade-off.

3.4.8 Case Study: Finding Galactic Fundamental Planes


The identication of certain correlations among parameters has lead to important discoveries in astronomy. For example, the class of elliptical and spiral galaxies (including dwarfs) have been found to occupy a two dimensional space inside a three dimensional space of observed parameters, radius, mean surface brightness and velocity dispersion. This two dimensional plane has been referred to as the Fundamental Plane([83, 147]). This section presents a case study involving the detection of a fundamental plane among galaxy parameters distributed across two catalogs: 2MASS and SDSS. We consider several different combinations of parameters and explore for a fundamental plane in each (once such combination was mentioned above and described earlier in Example 3). Our goal is to demonstrate that, using our distributed covariance matrix algorithm to approximate the principal components, we can nd a very similar fundamental plane as that obtained by applying a centralized PCA. Note that our ultimate goal is to enable new discoveries in astronomy through our DDM algorithms and DEMAC system. However, for now, our primary goal is to demonstrate that our distributed covariance algorithm could have found a very similar results to the centralized approach at a fraction of the communication cost. Therefore, we argue that DEMAC could provide a valuable tool for astronomers wishing to explore many parameter spaces across different catalogs for fundamental planes. 82

 I W } W  S I W } )(  S I 5 W

g  g x } S x } g u

} pg

 S

H e

 S X aP W d H W I X I W P 1

} 5 x 0(  }

f 8} }

P e

e}

 W

P e

P 1 P

H 1 H 1 H 1 H

(3.9)

(3.10)

(3.11)

In our study we measure the accuracy of our distributed algorithm in terms of the similarity between its results and those of a centralized approach. We examine accuracy at various amounts of communication allowed the distributed algorithm in order to assess the trade-off described at the end of Section 3.4.7. For each amount of communication allowed, we ran the distributed algorithm 100 times with a different random matrix and report the average result (except where otherwise noted). For the purposes of our study, a real distributed environment is not necessary. Thus, for simplicity, we used a single machine and simulated a distributed environment. We carried out two sets of experiments. The rst involves three parameters already examined in the Astronomy literature [83, 147] (Example 3) and the fundamental plane observed. The second involves several previously unexplored combinations of parameters. Experiment Set One We prepared our test data as follows. Using the web interfaces of 2MASS and SDSS, http://irsa.ipac.caltech.edu/applications/Gator/ and http://cas.sdss.org/astro/en/tools/crossid/upload.asp, and the SDSS object cross id tool, we obtained an aggregate dataset involving attributes from 2MASS and SDSS lying in the sky region between right ascension (ra) 150 and 200, declination (dec) 0 and 15. The aggregated dataset had the following attributes from SDSS: Petrosian I band angular effective radius (Iaer), redshift (rs), and velocity dispersion (vd);19 and had the following attribute from 2MASS: K band mean surface brightness (Kmsb).20 After removing tuples with missing attributes, we had a 1307 tuple dataset with four attributes. We produced a new attribute, logarithm Petrosian I band effective radius (log(Ier)), as log(Iaer*rs) and a new attribute, logarithm velocity dispersion (log(vd)), by applying the logarithm to vd. We dropped all attributes except those to obtain the three attribute dataset, log(Ier), log(vd), Kmsb. Finally, we normalized each column by subtracting its mean from each entry and dividing by its sample standard deviation (as described in Section 3.4.6). We applied PCA directly to this dataset to obtain the centralization based results. Then we treated this dataset as if it were distributed (assuming cross match indeces have been created as described earlier). This data can be thought of as a virtual table with attributes log(Ier) and log(vd) located at one site and attribute Kmsb at another. Finally, we applied our distributed covariance matrix algorithm and computed the principal components from the resulting matrix. Note, our dataset is somewhat small and not necessarily indicative of a scenario where DEMAC would be used in practice. However, for the purposes of our study (accuracy with respect to communication) it sufces. Figure 3.25 shows the percentage of variance captured as a function of communication percentage (i.e. at 15%, the distributed algorithm uses
_ (galaxy view), z (SpecObj view) and velDisp (SpecObj view) in SDSS DR4 _ _ in the extended source catalog in the All Sky Data http://www.ipac.caltech.edu/2mass/releases/allsky/index.html
20

s r u u v u Bkf wrh t tq p o 8kiFBe n m l j hg f

19

Release,

83

d B

S ( 5 5

d 

 v ( Qu

105

Std Error Bars Distributed Variance Centralized Variance

Percentage of variance captured by first and second PC

100

95

90

85

80

10

20

30

40 50 60 Communication Percentage

70

80

90

100

Figure 3.25: Communication percentage vs. percent of variance captured, (log(Ier), log(vd), Kmsb) dataset.

bytes). Error bars indicate standard deviation recall the percentage of variance captured numbers are averages over 100 trials. First observe that the percentage captured by the centralized approach, 90.5%, replicates the known result that a fundamental plane exists among these parameters. Indeed the dataset ts fairly nicely on the plane formed by the rst two PCs. Also observe that the percentage of variance captured by the distributed algorithm (including one standard deviation) using as little as 10% communication never strays more than 5 percent from 90.5%. This is a reasonibly accurate result indicating that the distributed algorithm identies the existence of a plane using 90% less communication. As such, this provides evidence that the distributed algorithm would serve as a good browser allowing the user to get decent approximate results at a sharply reduced communication cost. If this piques the users interest, she can go through the trouble of centralizing the data and carrying out an exact analysis. Interestingly, the average percentage captured by the distributed algorithm appears to approach the true percentage captured, 90.5%, very slowly (as the communication percentage approaches innity, the average percentage captured must approach 90.5%). At the present we dont have an explanation for the slow approach. However, as the communication increases, the standard deviation decreases substantially (as expected). To analyze the accuracy of the actual principal components computed by the distributed algorithm, we consider the data projected onto each pair of PCs. The projection onto the true rst and second PC ought to appear with much scatter in both directions as it represents the view of the data perpendicular to the plane. And, the projections onto the rst, third and second, third PCs ought to appear more attened as they repre84

sent the view of the data perpendicular to the edge of the plane. Figures 3.26 and 3.27 displays the results. The left column depicts the projections onto the PCs computed by the centralized analysis (true PCs). Here we see the fundamental plane. The right column depicts the projections onto the PCs computed by our distributed algorithm at 15% communication (for one random matrix and not the average over 100 trials). We see a similar pattern, indicating that the PCs computed by the distributed algorithm are quite accurate in the sense that they produce very similar projections as those produced by the true PCs. In closing, it is important to stress that we are not claiming that the actual projections can be computed in a communication-efcient fashion (they cant). Rather, that the PCs computed in a distributed fashion are accurate as measured by the projection similarity with the true PCs. Experiment Set Two We prepared our test data as follows. Using the web interfaces of 2MASS, SDSS and the SDSS object cross id tool, we obtained an aggregate dataset involving attributes from 2MASS and SDSS lying in the sky region between right ascension (ra) 150 and 200, declination (dec) 0 and 15. The aggregated dataset had the following attributes from SDSS: Petrosian Flux in the U band (petroU), Petrosian Flux in the G band (petroG), Petrosian Flux in the R band (petroR), Petrosian Flux in the I band (petroI), and Petrosian Flux in the Z band (petroZ)21 ; and had the following attributes from 2MASS: K band mean surface brightness (Kmsb) and K concentration index (KconInd). 22 We had a 29638 tuple dataset with seven attributes. We produced new attributes petroU-I petroU - petroI, petroU-R petroU - petroR, petroU-G petroU - petroG, petroG-R petroG - petroR, petroG-I petroG - petroI, and petroR-I petroR - petroI. We also produced new attribute, logarithm K concentration index (log(KconInd)), by applying the logarithm to KconInd. In each experiment, we dropped all attributes except the following to obtain out test dataset: petroU-I, petroU-R, petroU-G, petroG-R, petroG-I, petroR-I, log(KconInd), and Kmsb. Finally, we normalized each column by subtracting its mean from each entry and dividing by its sample standard deviation (as described in Section 3.4.6). We considered the following six combinations of three attributes: (petroU-I,log(KconInd),Kmsb), (petroU-R,log(KconInd),Kmsb), (petroU-G,log(KconInd),Kmsb), (petroG-R,log(KconInd),Kmsb), (petroG-I,log(KconInd),Kmsb), and (petroR-I,log(KconInd),Kmsb). In each case we carried out a distributed PCA experiment just as in experiment set one. Table 3.11 depicts the results at 15% communication. The Band column indicates the combination of attributes used, e.g. U-I indicates (petroU-I,log(KconInd),Kmsb). The Centralized Variance column contains the sum of the variances (times 100) of the rst two principal components found by the centralized algorithm. In effect, it contains the percent of variance captured by the rst two PCs. The Distributed Variance column contains the sum of the variances of the rst two PCs found by the distributed algorithm (averaged over 100 trials). The STD Distributed Variance column contains the standard deviation in this average over 100 trials.
22

and

85

s r ~ r} rn ir j p u u v u Bkf wrh t tq p | z m x j hg f ryiFBe o ryig e h ryig Be z ryiFBe t ykiFBe z m x j h f z m x j h f z m x j hg f z m x j hg f

21

_ , _

_ ,

_ , _, _ (PhotoObjAll) in the extended source catalog in the All Sky Data Release

S S

4 2nd Principal Component 2nd Principal Component 4 2 0 2 1st Principal Component 4 6

6 6

6 6

2 0 2 1st Principal Component

(a) PCs from centralized analysis

(b) PCs from distributed algorithm

4 3rd Principal Component 3rd Principal Component 4 2 0 2 1st Principal Component 4 6

6 6

6 6

2 0 2 1st Principal Component

(c) PCs from centralized analysis

(d) PCs from distributed algorithm

Figure 3.26: Projections, PC1 vs. PC2 and PC1 vs. PC3; communication percentage 15%, (log(Ier), log(vd), Kmsb) dataset. We see that in all cases the centralized experiments yield a relatively weak fundamental plane (72% of the variance captured by the rst two PCs) relative to the fundamental plane from experiment set one (90.5% captured). The distributed algorithm in all cases does a decent job at replicating this result: 67.8% to 69.0% of the variance captured (all with standard deviation less than 0.99). These results come at an 85% communication savings over centralizing the data. To further illuminate the trade-off between communication savings and accuracy,

86

4 3rd Principal Component 3rd Principal Component 4 2 0 2 2nd Principal Component 4 6

6 6

6 6

2 0 2 2nd Principal Component

(a) PCs from centralized analysis

(b) PCs from distributed algorithm

Figure 3.27: Projections, PC2 vs. PC3; communication percentage 15%, (log(Ier), log(vd), Kmsb) dataset. Band U-I U-R U-G G-R G-I R-I Centralized Variance 72.6186 72.5375 72.6392 72.3501 72.4842 72.7451 Distributed Variance 67.8766 68.0817 69.0102 68.5664 68.6199 68.0521 STD Distributed Variance 0.717 0.8191 0.9781 0.9879 0.9842 0.8034

Table 3.11: The centralized and distributed variances captured by the rst and second PCs (15% communication).

Figure 3.28 shows the percentage of variance captured on the petroR-I dataset as a function of communication percentage (error bars indicate standard deviation recall the percentage of variance captured numbers are averages over 100 trials). Interestingly, the average percentage captured by the distributed algorithm appears to move away from the true percentage (72%) for communication up to 40%, then move very slightly toward 72%. This is surprising since the distributed algorithm error must approach zero as its communication percentage approaches innity. The results appear to indicate that this approach is quite slow. At the present we dont have an explanation for this phenomenon. However, as the communication increases, the standard deviation decreases (as expected). Moreover, despite the slow approach, the percentage of variance captured by the distributed algorithm (including one standard deviation) never strays more than 6 percent from the true percent captured a reasonably accurate g-

87

percentage of variance captured by first and second PC

74 73 72 71 70 69 68 67 66

Distributed Variance Centralized Variance

20

40 60 Communication Percentage

80

100

Figure 3.28: Communication percentage vs. percent of variance captured, (petroRI,log(KconInd),Kmsb) dataset.

ure. As in experiment set one, we analyze the accuracy of the actual principal components computed by the distributed algorithm by considering their data projections. We do so for the petroR-I dataset. Figures 3.29 and 3.30 displays the results. The left column depicts the projections onto the PCs computed by the centralized analysis (true PCs). Note that 10 projected data points (out of 29638) are omitted from the centralized PC1 vs. PC2 and PC2 vs. PC3 plots due to scaling. The y-coordinate for these points (x in case of PC2 vs. PC3) does not lie in the range [-20,20] (in both cases a few points have coordinate nearly 60 or -60). Unlike the case of experiment set one, we do not see a very pronounced plane (as expected). The right column depicts the projections onto the PCs computed by our distributed algorithm at 15% communication (for one random matrix and not the average over 100 trials). The distributed projections appear to indicate a fundamental plane somewhat more strongly than the centralized ones. This indicates some inaccuracy in the distributed PCs. However, since the distributed variances correctly did not indicate a strong fundamental plane, the inaccuracies in the PCs would likely not play an important role for users browsing for fundamental planes.

3.4.9 Summary
We proposed a system, DEMAC, for the distributed exploration of massive astronomical catalogs. DEMAC is to be built on top of the existing U.S. national virtual observatory environment and provide tools for data mining (as web services) without re88

20 15 10 2nd Principal Component 5 0 5 10 15 20 20 2nd Principal Component 15 10 5 0 5 1st Principal Component 10 15 20

20 15 10 5 0 5 10 15 20 20

15

10

5 0 5 1st Principal Component

10

15

20

(a) PCs from centralized analysis

(b) PCs from distributed algorithm

20 15 10 3rd Principal Component 5 0 5 10 15 20 20 3rd Principal Component 15 10 5 0 5 1st Principal Component 10 15 20

20 15 10 5 0 5 10 15 20 20

15

10

5 0 5 1st Principal Component

10

15

20

(c) PCs from centralized analysis

(d) PCs from distributed algorithm

Figure 3.29: Projections, PC1 vs. PC2 and PC1 vs. PC3; communication percentage 15%, (petroR-I,log(KconInd),Kmsb) dataset. quiring datasets to be down-loaded to a centralized server. Instead, the users will only down-load the output of the data mining process (a data mining model); the actual data mining from multiple data servers will be performed using communication-efcient DDM algorithms. The distributed algorithms we have developed sacrice perfect accuracy for communication savings. They offer approximate results at a considerably lower communication cost than that of exact results through centralization. As such, we see DEMAC as serving the role of an exploratory browser. Users can quickly get

89

20 15 10 3rd Principal Component 5 0 5 10 15 20 20 3rd Principal Component 15 10 5 0 5 2nd Principal Component 10 15 20

20 15 10 5 0 5 10 15 20 20

15

10

5 0 5 2nd Principal Component

10

15

20

(a) PCs from centralized analysis

(b) PCs from distributed algorithm

Figure 3.30: Projections, PC2 vs. PC3; communication percentage 15%, (petroRI,log(KconInd),Kmsb) dataset. (generally quite accurate) results for their distributed queries at low communication cost. Armed with these results, users can focus in on a specic query or portion of the datasets, and down-load for more intricate analysis. To illustrate the potential effectiveness of our system, we developed communicationefcient distributed algorithms for principal component analysis (PCA). Then, we carried out a case study using distributed PCA for detecting fundamental planes of astronomical parameters. We observed our distributed algorithm to replicate fairly nicely the fundamental plane results observed through centralized analysis but at signicantly reduced communication cost. In closing, we envision our system to increase the ease with which large, geographically distributed astronomy catalogs can be explored, by providing quick, lowcommunication solutions. Such benet will allow astronomers to better tap the riches of distributed virtual tables formed from joined and integrated sky survey catalogs.

90

Chapter 4

Future Work
4.1 The DEMAC system - further explorations
The DEMAC system proposed in section 3.4.3 is a currently ongoing research project. There are several directions for future work including development of a simulated grid environment for the DEMAC system using the OGSA-DAI infrastructure, analysis of the DDM algorithms taking into account the overhead due to web-services and development of other distributed data mining algorithms such as outlier detection that can be incorporated into the system. The following sections delve deeper into each of these directions for future work.

4.1.1 Grid-enabling DEMAC


A distributed system on the grid may contain several data repositories. These repositories may have different schemas, access mechanisms, models of storage, replication facilities and can be stored locally or remotely. The Open Grid Service Architecture (OGSA) data services provides mechanisms for virtualization and transparency over the disparate data repositories [137] on the grid. The specications for enabling unied data access and integration [182, 175, 183] using the service based architecture advocated in OGSA has been laid down by the Data Access and Integration Services Working Group of the Global Grid Forum (GGF). This family of specications denes web service interfaces to data resources, such as relational or XML databases and include properties that can be used to describe a data service or the resource to which access is being provided, and dene message patterns that support access to (query and update) data resources. In the section 2.2.3 we provide a brief description of the architecture of the OGSA-DAI infrastructure. In this section we propose a grid-enabled version of the DEMAC system using the OGSA-DAI services as a starting point. We intend to research the following aspects of DEMAC: 1. Distributed Schema Integration: The current simulation of DEMAC uses two different astronomy catalogs 2MASS [3] and SDSS [242]. Upon selection of the portion of the sky to be mined from these catalogs, the web interfaces of the 91

two sky surveys are used to download the data. The pre-processing of the data (including estimation of derieved attributes), cross-matching and index maintainence at distributed sites are all performed off-line. This procedure can be time consuming and has to be repeated every time a different portion of the sky is chosen for analysis. Hence it appears unrealistic, and motivates the need for development of better data access and integration schemes. Since the federated astronomy databases are envisioned to be grid-enabled, an interesting possibility is to build a service based schema integration module. This module should allow integration of heterogeneous data sources (including at les, relational and XML databases) and also come up with novel methods for indexing the integrated virtual databases. 2. Distributed Query Processing: The Open Grid Service Architecture - Distributed Query Processing (OGSA-DQP) [193] provides a service based distributed query processor and is built on top of the OGSA-DAI introduced in section 2.2.3. It supports queries over OGSA-DAI data services and uses grid data services to provide consistent access to metadata and interact with databases on the grid. The service-based DQP framework consists of two services: (1) Grid Distributed Query Service (Coordinator) and (2) Query Evaluation Service (Evaluator). The role of the coordinator is to obtain metadata to compile, partition and schedule distributed query execution plans over nodes in the Grid. The Evaluator is used by the Coordinator to execute query plans generated by the query compiler, optimiser and scheduler. While the distributed query processor is an interesting contribution, work still needs to be done for coordinated use of query processing services on the grid for scientic applications. 3. Distributed Workow Management: In order to enable composition of web services, there is a need to develop workow management schemes for distributed, scientic data mining. In section 2.4 we discussed some of the related work and discuss challenges that need to be overcome.

4.1.2 PCA based Outlier Detection on DEMAC


In the previous section 3.4.7 we have used Principal Component Analysis for dimension reduction of astronomy catalogs. However, PCA can also be used for outlier detection. While the principal components carry most of the variance in the data, the last components carry valuable information. Some techniques for outlier detection have been developed based on the last components [109, 121, 120, 146, 245]. These techniques look to identify data points which deviate sharply from the correlation structure of the data. Recall from Chapter 3 section 3.4.6 the component is a linear combination of the columns of (the column has coefcient ) with sample variance i.e. the variance over the entries of is . Thus, if is very small and there were no outlier data points, one would expect the entries of to be nearly constant. In this case, expresses a nearly linear relationship between the columns of . A data point which deviates sharply from the correlation structure of the data will likely have its

92

Rq

R|

Rq R| 5 ( R k6od

Aq

R|

R &q

entry deviate sharply from the rest of the entries (assuming no other outliers). Since the last components have the smallest , then an outliers entries in these components will likely stand out. This motivates examination of the following statistic for the data point (the row in ) and some : (4.1)

where denotes the entry of . is a user dened parameter. A possible criticism of this approach is pointed out in [146] page 237: it [ ] still gives insufcient weight to the last few PCs, [...] Because the PCs have decreasing variance with increasing index, the values of will typically become smaller as increases, and therefore implicitly gives the PCs decreasing weights as increases. This effect can be severe if some of the PCs have very small variances, and this is unsatisfactory as it is precisely the low-variance PCs which may be most effective ... To address this criticism, the components are normalized to give equal weight. Let denote the normalized principal direction: the vector whose entry is . The normalized principal component is . The sample variance of equals one, so, the weights of the normalized components are equal. This statistic we use for the data point is (following notation in [146])
A R  % H0( R q 5F S 5j Q A( B%% s

(4.2)

Using the above technique, we plan to develop a distributed top-k outlier detection algorithm for astronomy catalogs and provide experimental results to compare the performance of the distributed algorithm to existing centralized outlier detection algorithms. It would also be interesting to track the performance of the distributed algorithm in the simulated grid environment described above.

4.2 Proposed Plan of Research


The Table 4.1 illustrates the plan of research.

93

AF

AF

QD%

R S R q u f g

j Rq A R F R S 5j Q % 5HA(Bq A( C% s g j u

5F( R % H0q

Aq

AF

Aq

AF

AF

5F HA( R q

X  W

"

D% Q R q
i

R " 

Task Aug Develop an architecture for DDM using the OGSA-DAI framework Grid Enabling DEMAC Develop an architecture for Distributed Stream Mining on the grid Writing thesis Sept

2006 Oct

Nov

Dec

Jan

Feb

2007 March

Apr

Table 4.1: Research Plan

94

Bibliography
[1] S. Al Saira, F. S. Emmanouil, M. Ghanem, N. Giannadakis, Y. Guo, D. Kalaitzopolous, M. Osmond, A. Rowe, iJ. Syed and P. Wendel. The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery. International Journal of High Performance Computing Applications, 17(3):297315, 2003. [2] S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta, K. Liu, and S. Datta. Clustering Distributed Data Streams in Peer-to-Peer Environments. Information Science, 2005. In Press. [3] 2-Micron All Sky Survey. http://pegasus.phast.umass.edu. [4] Alexander Reinefeld and Florian Schintke. Concepts and Technologies for a Worldwide Grid Infrastructure. In Lecture Notes in Computer Science, volume 2400, pages 6271. Springer 2002. (c) Springer-Verlag, 2002. [5] Peter Brezany Alexander Whrer and A. Min Tjoa. Novel mediator architectures for Grid information systems. Future Generation Computer Systems, 21(1):107 114, January,2005 2005. [6] Peter Brezany Alexander Woehrer and Ivan Janciak. Virtualization of Heterogeneous Data Sources for Grid Information Systems. In Accepted for the MIPRO 2004, Opatija, Croatia, May 24-28, 2004 2004. [7] A. S. Ali, O. F. Rana, and I. J. Taylor. Web Services Composition for Distributed Data Mining. In ICPP 2005 Workshops. International Conference Workshops on Parallel Processing, pages 11 18. IEEE, June 2005. [8] Ali Shaikh Ali, Omer F. Rana and Ian J. Taylor. Web Services Composition for Distributed Data Mining. In Workshop on Web and Grid Services for Scientic Data Analysis (WAGSSDA), Oslo, Norway, 2005. [9] W. Allcock, I. Foster, S. Tuecke, A. Chervenak, and C. Kesselman. Protocols and services for distributed data-intensive science. In In Proceedings of Advanced Computing and Analysis Techniques in Physics Research(ACAT2000), 2000. [10] Alon Y. Halevy, Zachary G. Ives, Dan Suciu, Igor Tatarinov. Schema Mediation in Peer Data Management Systems. In 19th International Conference on Data Engineering (ICDE 2003), pages 505516, 2003. 95

[11] Amol Ghoting and Srinivasan Parthasarathy. Facilitating Interactive Distributed Data Stream Processing and Mining. In In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Systems (IPDPS), April 2004. [12] Arriaga R. and Vempala S. An Algorithmic Theory of Learning: Robust Concepts and Random Projection. In Proceedings of the 40th Foundations of Computer Science, 1999. [13] Szalay A.S., Gray J., Kunszt P., Thakar A., and Slutz D. Large Databases in Astronomy. In Mining the Sky, Proceedings of MPA/ESO/MPE workshop, pages 99118. Springer, 2001. [14] J. Gray A.S. Szalay, P.Z. Kunszt. The Sloan Digital Sky Survey Science Archive: Migrating a Multi-Terabyte Astronomical Archive from Object to Relational DBMS. Computing in Science and Engineering, IEEE Press, V5.5:2027, June 2003. [15] Astro grid. www.astrogrid.org. [16] The AUTON Project. http://www.autonlab.org/autonweb/showProject/3/. [17] P. Avery. Data Grids: A New Computational Infrastructure for Data Intensive Science. Technical Report GriPhyN Report 2002-24, GriPhyN, 2002. [18] P. Avery and Ian Foster. The GriPhyN Project: Towards Petascale Virtual-Data Grids. Technical Report GriPhyN Report 2000-1, GriPhyN, December 2001. Submitted to the 2000 NSF Information and Technology Research Program, NSF award ITR=0086044. [19] Ron Avnur and Joseph M. Hellerstein. Eddies: continuously adaptive query processing. In SIGMOD, pages 261272, 2000. [20] B. Babcock and C. Olston. Distributed top-k monitoring. In In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 2839, San Diego,California, June 2003. [21] B. Gilburd, A. Schuster, and R. Wolff. Privacy Preserving Data Mining on Data Grids in the presence of Malicious Participants. In Proceedings of HPDC, 2004, Honolulu,Hawaii, June 2004. [22] Babcock B., Babu S., Datar M., Motwani R., and Widom J. Models and Issues in Data Stream Systems. In Proceedings of the 21th ACM SIGMOD-SIGACTSIGART Symposium on Principals of Database Systems (PODS), pages 116, 2002. [23] Eric Bauer and Ron Kohavi. An empirical comparison of voting classication algorithms: Bagging, boosting, and variants. Machine Learning, 36(12):105 139, 1999.

96

[24] Boualem Benatallah, Quan Z. Sheng, and Marlon Dumas. The Self-Serv Environment for Web Services Composition. IEEE Internet Computing, 7(1):4048, 2003. [25] Steve Benford, Neil Crout, John Crowe, Stefan Egglestone, Malcom Foster, Alastair Hampshire, Barrie Hayes-gill, Alex Irune, Ben Palethorpe, Timothy Reid, and Mark Sumner. e-science from the antarctic to the GRID, August 2003. [26] Fran Berman. Viewpoint: From teragrid to knowledge grid. Commun. ACM, 44(11):2728, 2001. [27] R. Bhargava, H. Kargupta, and M. Powers. Energy Consumption in Data Analysis for On-board and Distributed Applications. In Proceedings of the 2003 International Conference on Machine Learning workshop on Machine Learning Technologies for Autonomous Space Applications, 2003. [28] bioGrid Biotechnology Information and Knowledge Grid. http://www.biogrid.net/. [29] BIRN: Biomedical Informatics http://www.nbirn.net/AU/index.htm. Research Network.

[30] Body media sensewear armband. http://www.bodymedia.com/index.jsp. [31] Borne K. Distributed Data Mining in the National Virtual Observatory. In Proceedings of the SPIE Conference on Data Mining and Knowledge Discovery: Theory, Tools, and Technology V, Vol. 5098, page 211, 2003. [32] Borne K., Arribas S., Bushouse H., Colina L., and Lucas R. A National Virtual Observatory (NVO) Science Case. In Proceedings of the Emergence of Cosmic Structure, New York: AIP, page 307, 2003. [33] Distributed Data Mining Techniques for Object Discovery in the National Virtual Observatory. http://is.arc.nasa.gov/IDU/tasks/NVODDM.html. [34] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123140, 1996. [35] Leo Breiman. Random forests. Machine Learning, 45(1):532, 2001. [36] C. L. Bridges and D. E. Goldberg. The nonuniform Walsh-schema transform. In G. J. E. Rawlins, editor, Foundations of Genetic Algorithms, pages 1322. Morgan Kaufmann, San Mateo, CA, 1991. [37] Giuseppe Bueti, Antonio Congiusta, and Domenico Talia. Developing distributed data mining applications in the knowledge grid framework. In Proc. of the 6th Int. Conf. on High Performance Computing for Computational Science (VECPAR 2004), volume 3402 of LNCS, pages 156169, Valencia, Spain, June 2004. ISBN 3-540-25424-2.

97

[38] C. Giannella, R. Bhargava, and H. Kargupta. Multi-Agent Systems and Distributed Data Mining. In Proceedings of 8th International Workshop on Cooperative Information Agents (CIA 2004), Lecture Notes in Articial Intelligence, volume 3191, Erfurt, Germany, September 27-29 2004. [39] Mario Cannataro, Antonio Congiusta, Carlo Mastroianni, Andrea Pugliese, Domenico Talia, and Paolo Truno. Grid-based data mining and knowledge discovery. In N. Zhong and J. Liu, editors, Intelligent Technologies for Information Analysis, pages 1945. Springer-Verlag, 2004. ISBN 3-540-40677-8. [40] Mario Cannataro, Antonio Congiusta, Andrea Pugliese, Domenico Talia, and Paolo Truno. Distributed data mining on grids: services, tools, and applications. IEEE Transactions on Systems, Man, and Cybernetics: Part B (TSMC-B), 34(6):24512465, 2004. [41] Mario Cannataro, Antonio Congiusta, Domenico Talia, and Paolo Truno. A data mining toolset for distributed high-performance platforms. In Proc. of the 3rd International Conference on Data Mining Methods and Databases for Engineering, Finance and Others Fields (Data Mining 2002), pages 4150, Southampton, UK, September 2002. WIT Press. ISBN 1-85312-925-9. [42] Mario Cannataro and Domenico Talia. Parallel and distributed knowledge discovery on the grid: A reference architecture. In Proc. of the 4th International Conference on Algorithms and Architectures for Parallel Computing (ICA3PP), pages 662673, Hong Kong, December 2000. World Scientic Publ. [43] Mario Cannataro and Domenico Talia. The knowledge grid. Communications of the ACM, 46(1):8993, 2003. [44] Mario Cannataro and Domenico Talia. Semantics and knowledge grids: Building the next-generation grid. IEEE Intelligent Systems, 19(1):5663, 2004. [45] Mario Cannataro, Domenico Talia, and Paolo Truno. Design of distributed data mining applications on the knowledge grid. In Proc. of the National Science Foundation Workshop on Next Generation Data Mining (NGDM02), pages 191195, Baltimore, Maryland, November 2002. [46] Mario Cannataro, Domenico Talia, and Paolo Truno. Design of distributed data mining applications on the knowledge grid. In H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha, editors, Data Mining: Next Generation Challenges and Future Directions, pages 6788. AAAI/MIT Press, 2004. ISBN 0-262-61203-8. [47] D. Caragea, A. Silvescu, and V. Honavar. A Framework for Learning from Distributed Data Using Sufcient Statistics and its Application to Learning Decision Trees. International Journal of Hybrid Intelligent Systems., 2003. [48] D. Caragea, A. Silvescu, and V. Honavar. Decision Tree Induction from Distributed, Heterogeneous, Autonomous Data Sources. In Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA 03), Tulsa, Oklahoma, 2003. 98

[49] Caragea D., Silvescu A., and Honavar V. Decision Tree Induction from Distributed Data Sources. In Proceedings of the Conference on Intelligent Systems Design and Applications, 2003. [50] Carl Barratt, Andrea Brogni, Matthew Chalmers, William R. Cobern, John Crowe, Don Cruickshank, Nigel Davies, Dave De Roure, Adrian Friday, Alastair Hampshire, Oliver J. Gibson, Chris Greenhalgh, Barrie Hayes-Gill, Jan Humble, Henk Muller, Ben Palethorpe, Tom Rodden, Chris Setchell, Mark Sumner, Oliver Storz, Lionel Tarassenko. Extending the Grid to Support Remote Medical Monitoring. In UK e-Science All Hands Meeting, 2003. [51] The CfA Redshift Catalogue, Version http://vizier.cfa.harvard.edu/viz-bin/Cat?VII/193. June 1995.

[52] Dipanjan Chakraborty, Anupam Joshi, Yelena Yesha, and Tim Finin. Toward distributed service discovery in pervasive computing environments. IEEE Transactions on Mobile Computing, 5(2):97112, 2006. [53] P. Chan and S. Stolfo. Experiments on multistrategy learning by meta-learning. In Proceeding of the Second International Conference on Information Knowledge Management, pages 314323, 1993. [54] P. Chan and S. Stolfo. Meta-learning for multistrategy and parallel learning. In Proceeding of the Second International Work on Multistrategy Learning, pages 150165, 1993. [55] P. Chan and S. Stolfo. Toward parallel and distributed learning by meta-learning. In Working Notes AAAI Work. Knowledge Discovery in Databases, pages 227 240. AAAI, 1993. [56] P. Chan and S. Stolfo. On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information System, 8:528, 1996. [57] P. Chan and S. Stolfo. Toward scalable learning with non-uniform class and cost distribution: A case study in credit card fraud detection. In Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, page o. AAAI Press, September 1998. [58] P. K. Chan and S. J. Stolfo. A comparative evaluation of voting and metalearning on partitioned data. In Proceedings of Thwelfth International Conference on Machine Learning, pages 9098, 1995. [59] P. K. Chan and S. J. Stolfo. Sharing learned models among remote database partitions by local meta-learning. In E. Simoudis, J. Han, and U. Fayyad, editors, The Second International Conference on Knowledge Discovery and Data Mining, pages 27. AAAI Press, 1996. [60] J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable continuous query system for Internet databases. In ACM SIGMOD00 International Conference on Management of Data, 2000. 99

[61] Liang Chen, Reddy K., and G. Agrawal. GATES: a grid-based middleware for processing distributed data streams. In Proceedings. 13th IEEE International Symposium on High Performance Distributed Computing, 2004, pages 192 201, Honolulu, Hawaii USA, 4-6 June 2004 2004. [62] Chen R. and Sivakumar K. A New Algorithm for Learning Parameters of a Bayesian Network from Distributed Data. Proceedings of the Second IEEE International Conference on Data Mining (ICDM), pages 585588, 2002. [63] Chen R., Sivakumar K., and Kargupta H. Learning Bayesian Network Structure from Distributed Data. In Proceedings of the Third SIAM International Conference on Data Mining, pages 284288, 2003. [64] Ann Chervenak, Ian Foster, Carl Kesselman, Charles Salisbury, and Steven Tuecke. The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientic Datasets. Journal of Network and Computer Applications, 23:187200, 2001. (based on conference publication from Proceedings of NetStore Conference 1999). [65] F. Chung. Spectral Graph Theory. American Mathematical Society, Providence, Rhode Island, USA, 1994. [66] D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shields, I. Taylor, and I. Wang. Programming Scientic and Distributed Workow with Triana Services. Grid Workow 2004 Special Issue of Concurrency and Computation: Practice and Experience, To be published, 2005. [67] The ClassX Project: Classifying http://heasarc.gsfc.nasa.gov/classx/. the High-Energy Universe.

[68] Carmela Comito, Anastasios Gounaris, Rizos Sakellariou, and Domenico Talia. Data integration and query reformulation in service-based grids. In Proc. of the 1st CoreGRID Integration Workshop, Pisa, Italy, November 2005. [69] Carmela Comito and Domenico Talia. Gdis: A service-based architecture for data integration on grids. In Proc. of OTM 2004 Workshops, volume 3292 of LNCS, pages 8898, Agia Napa, Cyprus, October 2004. Springer-Verlag. ISBN 3-540-23664-3. [70] Carmela Comito and Domenico Talia. Gdis: A service-based architecture for data integration on grids. In Proc. of OTM 2004 Workshops, volume 3292 of LNCS, pages 8898, Agia Napa, Cyprus, October 2004. Springer-Verlag. ISBN 3-540-23664-3. [71] Antonio Congiusta, Carlo Mastroianni, Andrea Pugliese, Domenico Talia, and Paolo Truno. Enabling knowledge discovery services on grids. In Proc. of the 2nd European AcrossGrids Conference (AxGrids 2004), volume 3165 of LNCS, pages 250259, Nicosia, Cyprus, January 2004. Springer-Verlag. ISBN 3-54022888-8. 100

[72] Antonio Congiusta, Andrea Pugliese, Domenico Talia, and Paolo Truno. Designing grid services for distributed knowledge discovery. Web Intelligence and Agent Systems (WIAS), 1(2):91104, 2003. [73] Antonio Congiusta, Domenico Talia, and Paolo Truno. Vega: A visual environment for developing complex grid applications. In Proc. of the First International Workshop on Knowledge Grid and Grid Intelligence (KGGI 2003), pages 5666, Halifax, Canada, October 2003. Department of Mathematics and Computing Science, Saint Marys University. ISBN 0-9734039-0-X. [74] C. Cortes, K. Fisher, D. Pregibon, A. Rogers, and F. Smith. Hancock: A language for extracting signatures from data stream. In Proceedings The ACM SIGKDD-2000, 2000. [75] D.Bosio,J.Casey,A.Frohner,L.Guy et al. Next Generation EU DataGrid Data Management Services. In Computing in High Energy Physics (CHEP 2003), March 2003. [76] Digital Dig - Data Mining in Astronomy. http://www.astrosociety.org/pubs/ezine/datamining.html. [77] Deep Wide-Field Survey. http://www.noao.edu/noao/noaodeep/. [78] Dhillon I. and Modha D. A Data-clustering Algorithm on Distributed Memory Multiprocessors. In Proceedings of the KDD99 Workshop on High Performance Knowledge Discovery, pages 245260, 1999. [79] Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization. Machine Learning, 40(2):139158, 2000. [80] Dipanjan Chakraborty and Anupam Joshi. Anamika: Distributed Service Composition Architecture for Pervasive Environments. SIGMOBILE Mob. Comput. Commun. Rev., 7(1):3840, 2003. [81] Dipanjan Chakraborty and Suraj Kumar Jaiswal and Archan Misray and Amit A. Nanavati. Middleware architecture for evaluation and selection of 3rd-party web services for service providers. icws, 0:647654, 2005. [82] Dipanjan Chakraborty, Anupam Joshi, Tim Finin, Yelena Yesha. Service Composition for Mobile Environments. Mobile Networks and Applications, 10:435 451, 2005. [83] Elliptical Galaxies: Merger Simulations and the Fundamental Plane. http://irs.ub.rug.nl/ppn/244277443. [84] Data Mining Grid. http://www.datamininggrid.org/. [85] Workshop on Data Mining and the Grid (DM-Grid http://www.cs.technion.ac.il/ ranw/dmgrid/program.html.

2004).

101

[86] Domenico Talia, Paolo Truno, Oreste Verta. WEKA4WS: a WSRF-enabled Weka Toolkit for Distributed Data Mining on Grids. In Proc. of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2005), volume LNAI vol. 3721, pages 309320, Porto, Portugal, October 2005. Springer-Verlag. [87] Pedro Domings and Geoff Hulten. Mining high-speed data streams. In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, August, 2000. [88] Harris Drucker and Corrina Cortes. Boosting decision trees. Advances in Neural Information Processing Systems, 8:479485, 1996. [89] H. Dutta, H. Kargupta, and A. Joshi. Orthogonal Decision Trees for ResourceConstrained Physiological Data Stream Monitoring using Mobile Devices. In Varadarajan Sridhar David A. Bader, Manish Parashar and Viktor K. Prasanna, editors, Lecture Notes in Computer Science (3769) High Performance Computing - HiPC 2005 , Goa,India, December 18-21 2005. Springer Science and Business Media. [90] M. Eisenhardt, W. Muller, and A. Henrich. Classifying Documents by Distributed P2P Clustering. In Proceedings of Informatik 2003, GI Lecture Notes in Informatics, Frankfurt, Germany, September 2003. [91] EU Data Grid. http://web.datagrid.cnr.it/. [92] Astrophysical Virtual Observatory. http://www.euro-vo.org/. [93] European Data Grid Project. http://edg-wp2.web.cern.ch/edg-wp2/index.html. [94] The FAEHIM project. http://users.cs.cf.ac.uk/Ali.Shaikhali/faehim/. [95] Wei Fan, Sal Stolfo, and Phillip Chan. Using conicts among multiple base classiers to measure the performance of stacking. In ICML-99 Workshop on Recent Advances in Meta-learning and Future Work, pages 10 17, 1999. [96] Wei Fan, Sal Stolfo, and Junxin Zhang. The application of adaboost for distributed, scalable and on-line learning. In Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, 1999. [97] J. Feigenbaum, S. Kannan, M. Strauss, and M Viswanathan. Testing and spot checking of data streams. In Proceedings of the 11th Symposium on Discrete Algorithm, ACM/SIAM, pages 165174, New York/Philadelphia, 2000. [98] Framework for Mining and http://www.itsc.uah.edu/f-mass/. Analysis of Space Science Data.

[99] Ian Foster. The Grid: A new Infrastructure for 21st Century Science. Physics Today, Feb 2002. 102

[100] Ian Foster and C. Kesselman. Computational Grids. Chapter 2 of The Grid: Blueprint for a New Computing Infrastructure, 1999. [101] Ian Foster and Carl Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufman, 2004. [102] Fred A. and Jain A. Data Clustering Using Evidence Accumulation. In Proceedings of the International Conference on Pattern Recognition 2002, pages 276280, 2002. [103] Yoav Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121(2):256285, 1995. [104] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th International Conference on Machine Learning, pages 148146, Murray Hill, NJ, 1996. Morgan Kaufmann. [105] Forman G. and Zhang B. Distributed Data Clustering Can Be Efcient and Exact. SIGKDD Explorations, 2(2):3438, 2000. [106] J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams. In SIGMOD 2001, Santa Babara, CA, 2001. [107] Chris Giannella, Kun Liu, Todd Olsen, and Hillol Kargupta. Communication Efcient Construction of Decision Trees Over Heterogeneously Distributed Data. In Proceedings of The Fourth IEEE International Conference on Data Mining (ICDM04), Brighton, UK, November 2004. [108] Giannella C., Liu K., Olsen T., and Kargupta H. Communication Efcient Construction of Decision Trees Over Heterogeneously Distributed Data. In Proceedings of the The Fourth IEEE International Conference on Data Mining (ICDM), 2004. [109] Gnanadesikan R. and Kettenring J. Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data. Biometrics, 28:81124, 1972. [110] D. Goldberg. Genetic algorithms and Walsh functions: Part I, a gentle introduction. Complex Systems, 3(2):129152, 1989. [111] V. Gorodetski, V. Skormin, L. Popyack, and O. Karsaev. Distributed learning in a data fusion systems. In Proceedings The Conference of the World Computer Congress (WCC-2000) Intelligent Information Processing (IIP2000), Beijing, China, 2000. [112] GridMiner. www.gridminer.org. [113] Gridpp. www.gridpp.ac.uk. [114] Grid Weka. http://smi.ucd.ie/ rinat/weka/. [115] Grid Physics Network. http://www.griphyn.org. 103

[116] GRIST: Grid Data Mining for Astronomy. http://grist.caltech.edu. [117] S. Guha, N. Mishra, R. Motwani, and L. OCallaghan. Clustering data streams. In Proceedings of the Annual Symposium on Foundations of Computer Science, pages 359366, November,2000. [118] Y. Guo and J. Sutiwaraphun. Distributed learning with knowledge probing: A new framework for distributed data mining. In Advances in Distributed and Parallel Knowledge Discovery,Eds: Hillol Kargupa and Phillip Chan. MIT Press, 2000. [119] H. Dutta H. Kargupta, B. Park. Orthogonal Decision Trees. In Accepted for publication in Transactions on Knowledge and Data Engineering (In press), 2005. [120] Hawkins D. The Detection of Errors in Multivariate Data Using Principal Components. Journal of the American Statistical Association, 69(346):340344, 1974. [121] Hawkins D. and Fatti P. Exploring Multivariate Data Using the Minor Principal Components. The Statistician, 33:325338, 1984. [122] Helen Conover, Sara J. Graves, Rahul Ramachandran, Sandi Redman, John Rushing, Steve Tanner, Robert Wilhelmson. Data Mining on the TeraGrid. In Supercomputing Conference, Phoenix, AZ, Nov. 15 2003. [123] Hershberger, D. and Kargupta, H. Distributed Multivariate Regression Using Wavelet-based Collective Data Mining. Journal of Parallel and Distributed Computing, 61:372400, 1999. [124] Tony Hey and Anne Trefethen. The Data Deluge. Grid Computing - Making the Global Infrastructure a Reality, January 2003. [125] Hinke T. and Novotny J. Data Mining on NASAs Information Power Grid. In Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing (HPDC00), page 292, 2000. [126] Hinneburg A. and Keim D. An Efcient Approach to Clustering in Large Multimedia Databases with Noise. In Proceedings of the 1998 International Confernece on Knowledge Discovery and Data Mining (KDD), pages 5865, 1998. [127] J. Hofer and P. Brezany. Distributed Decision Tree Induction within the grid data mining framework gridminer-core. Technical Report TR 2004-04, University of Vienna, Vienna, Austria, March 2004. [128] Juergen Hofer and Peter Brezany. DIGIDT: Distributed Classier Construction in the Grid Data Mining Framework GridMiner-Core. In Proceedings of the Workshop on Data Mining and the Grid (GM-Grid 2004) held in conjunction with the 4th IEEE International Conference on Data Mining (ICDM04), November 1-4 2004. 104

[129] Hofer J. and Brezany P. Distributed Decision Tree Induction Within the Grid Data Mining Framework. In Technical Report TR2004-04, Institute for Software Science, University of Vienna, 2004. [130] H.Stockinger, F.Dono,E.Laure, S.Muzzafar et al. Grid Data Management in Action. In Computing in High Energy Physics (CHEP 2003), March 2003. [131] G. Hulten, Spencer L., and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 2001. ACM Press. [132] The human genome project. http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml. [133] I. Foster. Globus Toolkit Version 4: Software for Service-Oriented Systems. IFIP International Conference on Network and Parallel Computing, SpringerVerlag, LNCS 3779:pp 213, 2005. [134] I. Foster, C. Kesselman, J. Nick, S. Tuecke. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. Open Grid Service Infrastructure WG, Global Grid Forum., June 2002. [135] I. Foster, C. Kesselman, S. Tuecke. The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International J. Supercomputer Applications, 15(3), 2001. [136] I. Foster, E. Alpert, A. Chervenak, B. Drach, C. Kesselman, V. Nefedova, D. Middleton, A. Shoshani, A. Sim, D. Williams. The Earth System Grid II: Turning Climate Datasets Into Community Resources. In Proceedings of the American Meterologcal Society Conference, 2001. [137] I. Foster, H. Kishimoto, A. Savva, D. Berry, A. Djaoui, A. Grimshaw, B. Horn, F. Maciel, F. Siebenlist, R. Subramaniam, J. Treadwell, J. Von Reich. The Open Grid Services Architecture, Version 1.0. In Informational Document, Global Grid Forum (GGF), January 2005. [138] Adriana Iamnitchi and Domenico Talia. P2p computing and interaction with grids. Future Generation Computer Systems (FGCS), 21(3):331332, 2005. [139] Informing business and regional policy. http://www.epcc.ed.ac.uk/projects/inwa/. [140] International Virtual Data Grid Laboratory. http://www.ivdgl.org/. [141] J. da Silva, C. Giannella, R. Bhargava, H. Kargupta, M. Klusch. Distributed Data Mining and Agents. In Engineering Applications of Articial Intelligence, volume 18, pages 791807, 2005. [142] J. Kim, Y. Gil, M. Spraragen. A Knowledge-Based Approach to Interactive Workow Composition. In In Workshop: Planning and Scheduling for Web and Grid Services, at the 14th International Conference on Automatic Planning and Scheduling (ICAPS 04), Whistler, Canada, 2004. 105

[143] J. Syed, M. Ghanem, Y.Guo. Discovery Processes: Representation and Re-use. In UK e-Science All-hands Conference, Shefeld, UK, Sept 2002. [144] Januzaj E., Kriegel H.-P., and Pfeie M. DBDC: Density Based Distributed Clustering. In Proceedings of EDBT in Lecture Notes in Computer Science 2992, pages 88105, 2004. [145] Johnson E. and Kargupta H. Collective, Hierarchical Clustering From Distributed, Heterogeneous Data. In M. Zaki and C. Ho, editors, Lecture Notes in Computer Science, volume 1759, pages 221244. Springer-Verlag, 1999. [146] Jolliffe I. Principal Component Analysis. Springer-Verlag, 2002. [147] Lewis A. Jones and Warrick J. Couch. A statistical comparison of line strength variations in coma and cluster galaxies at z 0.3. Astronomical Society, Australia, 15:309317, 1998. [148] H. Kargupta, R. Bhargava, K. Liu, M. Powers, P. Blair, S. Bushra, J. Dull, K. Sarkar, M. Klein, M. Vasa, and D. Handy. VEDAS: A Mobile and Distributed Data Stream Mining System for Real-time Vehicle Monitoring. In Proceedings of 2004 SIAM International Conference on Data Mining (SDM04), Lake Buena Vista, FL, April 2004. [149] H. Kargupta and B. Park. Mining time-critical data stream using the Fourier spectrum of decision trees. In Proceedings of the IEEE International Conference on Data Mining, pages 281288. IEEE Press, 2001. [150] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Collective data mining: A new perspective towards distributed data mining. In Advances in Distributed and Parallel Knowledge Discovery, Eds: Kargupta, Hillol and Chan, Philip. AAAI/MIT Press, 2000. [151] H. Kargupta and B.H. Park. A fourier spectrum-based approach to represent decision trees for mining data streams in mobile environments. IEEE Transactions on Knowledge and Data Engineering, 16(2):216229, 2002. [152] H. Kargupta, B.H. Park, S. Pittie, L. Liu, D. Kushraj, and K. Sarkar. Mobimine: Monitoring the stock market from a PDA. ACM SIGKDD Explorations, 3(2):37 46, January 2002. [153] Hillol Kargupta and Haimonti Dutta. Orthogonal Decision Trees. In Proceedings of The Fourth IEEE International Conference on Data Mining (ICDM04), Brighton, UK, November 2004. [154] Kargupta H. and Chan P. (editors). Advances in Distributed and Parallel Knowledge Discovery. AAAI press, Menlo Park, CA, 2000. [155] Kargupta H. and Puttagunta V. An Efcient Randomized Algorithm for Distributed Principal Component Analysis from Heterogeneous Data. In Proceedings of the Workshop on High Performance Data Mining in conjunction with the Fourth SIAM International Conference on Data Mining, 2004. 106

[156] Kargupta H. and Sivakumar K. Existential Pleasures of Distributed Data Mining. In Data Mining: Next Generation Challenges and Future Directions, edited by H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha, MIT/AAAI Press, pages 326, 2004. [157] Kargupta H., Huang W., Sivakumar K., and Johnson E. Distributed clustering using collective principal component analysis. Knowledge and Information Systems Journal, 3:422448, 2001. [158] First International Workshop on Knowldge and Data Mining Grid (KDMG05). http://laurel.datsi..upm.es/KDMG05/. [159] Knowledge Grid Lab. http://dns2.icar.cnr.it/kgrid/. [160] M. Klusch, S. Lodi, and G. L. Moro. Distributed Clustering Based on Sampling Local Density Estimates. In Proceedings of International Joint Conference on Articial Intelligence (IJCAI 2003), pages 485490, Mexico, August 2003. [161] Konstantinos Karasavvas, Mario Antonioletti, Malcolm Atkinson, Neil Chue Hong, Tom Sugden, Alastair Hume, Mike Jackson, Amrey Krause, Charaka Palansuriya. Introduction to OGSA-DAI Services. Lecture Notes in Computer Science , 3458:1 12, Jun 2005. [162] Yordan Kostov and Govid Rao. Low-cost optical instrumentation for biomedical measurements. Review of Scientic Instruments, 71(12):43614373, December 2000. [163] J. Kotecha, V. Ramachandran, and A. Sayeed. Distributed multi-target classication in wireless sensor networks. IEEE Journal of Selected Areas in Communications (Special Issue on Self-Organizing Distributed Collaborative Sensor Networks), 2003. [164] Ludmila I. Kuncheva. Combining Pattern Classiers: Methods and Algorithms. Wiley-Interscience, Hoboken, N.J., July, 2004. [165] E. Kushilevitz and Y. Mansour. Learning decision trees using the Fourier spectrum. SIAM Journal oo Computing, 22(6):13311348, 1993. [166] S. W. Kwok and C. Carter. Multiple decision trees. Uncertainty in Articial Intelligence 4, pages 327335, 1990. [167] Lazarevic A., Pokrajac D., and Obradovic Z. Distributed Clustering and Local Regression for Knowledge Discovery in Multiple Spatial Databases. In Proceedings of the 8th European Symposium on Articial Neural Networks, pages 129134, 2000. [168] M. Lenzerini. Data integration: a theoretical perspective. In ACM Press, editor, In Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems: PODS 2002, pages 233246, Madison, Wisconsin, June 35 2002. ACM Press, New York, NY 10036, USA. 107

[169] F. Leymann and D. Roller. Workow-Based Applications. IBM Systems Journal, 36(1):102123, 1997. [170] Large Hadron Collider (LHC). http://lcg.web.cern.ch/LCG/. [171] Kuncheva L.I. A Theoretical Study on Six Classier Fusion Strategies. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):281286, 2002. [172] Kuncheva L.I. Fuzzy vs Non-fuzzy in combining classiers designed by boosting. IEEE Transactions on Fuzzy Systems, 11(6):729741, 2003. [173] Laser Interferometer http://www.ligo.caltech.edu/. Gravitational Wave Observatory.

[174] N. Linial, Y. Mansour, and N. Nisan. Constant depth circuits, fourier transform, and learnability. Journal of the ACM, 40:607620, 1993. [175] M. Antonioletti, B. Collins, A. Krause, S. Laws, S. Malaika, J. Magowan, N.W. Paton. Web services data access and integration - the relational realization (WSDAIR), Version 1.0. In Global Grid Forum (GGF), 2005. [176] M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, S. Zdonik. Scalable Distributed Stream Processing. In First Biennial Conference on Innovative Database Systems (CIDR03), Asilomar, CA, January 2003 2003. [177] Sam Madden and Michael J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In ICDE, 2002. [178] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive continuous queries over streams. In SIGMOD Conference, 2002. [179] S. Majithia, M. S. Shields, I. J. Taylor, and I. Wang. Triana: A Graphical Web Service Composition and Execution Toolkit. In Proceedings of the IEEE International Conference on Web Services (ICWS04), pages 514524. IEEE Computer Society, 2004. [180] S. Majithia, I. Taylor, M. S., and I. Wang. Triana as a Graphical Web Services Composition Toolkit. In Simon J. Cox, editor, Proceedings of UK e-Science All Hands Meeting, pages 494500. EPSRC, September 2003. [181] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In International Conference on Data Engineering (ICDE05), 2005. [182] M.Antonioletti, A. Krause, S. Hastings, S. Langella, S. Laws, S.Malaika, N.W. Paton. Web services data access and integration - the XML realization (WSDAIX), Version 1.0,. In Global Grid Forum (GGF), 2005. 108

[183] M.Antonioletti, M. Atkinson, A. Krause, S. Laws, S. Malaika, N.W. Paton, D. Pearson, G. Riccardi. Web services data access and integration - the core (WSDAI) Specication, Version 1.0. In Global Grid Forum (GGF), 2005. [184] Mara S. Prez, Alberto Snchez, Pilar Herrero, Vctor Robles, Jos M. Pea. Adapting the Weka Data Mining Toolkit to a Grid Based Environment. In Lecture Notes in Computer Science, volume 3528, pages 492 497, January 2005. [185] Mario Antonioletti, Malcolm Atkinson, Rob Baxter, Andrew Borley, Neil P. Chue Hong, Brian Collins, Neil Hardman, Alastair C. Hume, Alan Knox, Mike Jackson, Amy Krause, Simon Laws, James Magowan, Norman W. Paton, Dave Pearson, Tom Sugden, Paul Watson, Martin Westhead. The design and implementation of Grid database services in OGSA-DAI. Concurrency and Computation: Practice and Experience , 17(2-4):357376, November 1-4 2005. [186] McClean S., Scotney B., and Greer K. Conceptual Clustering Heterogeneous Distributed Databases. In Workshop on Distributed and Parallel Knowledge Discovery, Boston, MA, 2000. [187] Distributed Data Mining in Astrophysics and http://www.cs.queensu.ca/home/mcconell/DDMAstro.html. Astronomy.

[188] Merugu S. and Ghosh J. Privacy-Preserving Distributed Clustering Using Generative Models. In Proceedings of the IEEE Conference on Data Mining (ICDM), pages 211218, November 2003. [189] C. Merz and M. Pazzani. A principal components approach to combining regression estimates. Machine Learning, 36:932, 1999. [190] Christoper J. Merz and Michael J. Pazzani. A principal components approach to combining regression estimates. Machine Learning, 36(12):932, 1999. [191] Michael Beynon and Renato Ferreira and Tahsin M. Kurc and Alan Sussman and Joel H. Saltz. Datacutter: Middleware for ltering very large scientic datasets on archival storage systems. In IEEE Symposium on Mass Storage Systems, pages 119134, 2000. [192] Michael Russell, Gabrielle Allen, Ian Foster, Ed Seidel, Jason Novotny, John Shalf, Gregor von Laszewski and Greg Daues. The Astrophysics Simulation Collaboratory: A Science Portal Enabling Community Software Development. In Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing, pages 207215, San Francisco, CA, 7-9 August 2001. [193] M.Nedim Alpdemir, Arijit Mukherjee, Norman W. Paton, Paul Watson, Alvaro A.A. Fernandes, Anastasios Gounaris, and Jim Smith. OGSA-DQP: A servicebased distributed query processor for the Grid. In Simon J Cox, editor, Proceedings of UK e-Science All Hands Meeting Nottingham. EPSRC, 24th September 2003. [194] Mobile medical monitoring. http://www.equator.ac.uk/index.php/articles/634. 109

[195] NASAs Information Power Grid. http://www.ipg.nasa.gov/. [196] NASAs Data Mining Resources for Space Science. http://rings.gsfc.nasa.gov/ borne/nvo_datamining.html. [197] World data center for meterology. http://www.ncdc.noaa.gov/oa/wmo/wdcamet.html. [198] F. Neubauer, A. Hoheisel, and J. Geiler. Workow-based Grid applications. Future Generation Comp. Syst., 22(1-2):615, 2006. [199] Nielsen R. Context Vectors: General Purpose Approximate Meaning Representations Self-organized From Raw Data. In Computational Intelligence: Imitating Life, pages 4356. IEEE Press, 1994. [200] Nikolaos Giannadakis, Anthony Rowe, Moustafa Ghanem and Yike Guo. InfoGrid: Providing Information Integration for Knowledge Discovery. Information Sciences, 155:199226, 2003. [201] Nithya Vijayakumar, Ying Liu, and Beth Plale. Calder: Enabling Grid Access to Data Streams. In IEEE High Performance Distributed Computing (HPDC), Raleigh North Carolina, July 2005. [202] Nithya Vijayakumar, Ying Liu, and Beth Plale. Calder Query Grid Service: Insights and Experimental Evaluation. In IEEE Cluster Computing and Grid (CCGrid), May 2006. [203] US National Virtual Observatory. http://www.us-vo.org/. [204] O. Wolfson, S. Chamberlain, P. Sistla, B. Xu, J. Zhou. DOMINO:Databases fOr MovINg Objects tracking. In Proceedings of the ACM-SIGMOD 1999, International Conference on Management of Data, Philadelphia, PA, June 1999. [205] OGSA DAI Project. http://www.ogsadai.org.uk/. [206] C. Olston, J. Jiang, and J. Widom. Adaptive lters for continuous queries over distributed data streams. In ACM SIGMOD03 International Conference on Management of Data, 2003. [207] David Opitz and Richard Maclin. Popular ensemble methods: An empirical study. Journal of Articial Intelligence Research, 11:169198, 1999. [208] P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A Model for Sequence Databases. In In Proc. IEEE Intl. Conf. on Data Engineering (ICDE), pages 232239, 1995. [209] B. Park. Knowledge Discovery from Heterogeneous Data Streams Using Fourier Spectrum of Decision Trees. PhD thesis, Washington State University, 2001. PhD. Dissertation. [210] B. Park, H. Kargupta, E. Johnson, E. Sanseverino, D. Hershberger, and L. Silvestre. Distributed, collaborative data analysis from heterogeneous sites using a scalable evolutionary technique. Applied Intelligence, 16(1):1942, 2002. 110

[211] B. Park, Ayyagari R., and H. Kargupta. A fourier analysis-based approach to learning decision trees in a distributed environment. In Proceedings of the First SIAM Internation Conference on Data Mining, Chicago, US, 2001. [212] B. H. Park and H. Kargupta. Constructing simpler decision trees from ensemble models using fourier analysis. In Proceedings of the 7th Workshop on Research Issues in Data Mining and Knowledge Discovery, ACM SIGMOD, pages 1823, 2002. [213] B. H. Park and H. Kargupta. Distributed data mining: Algorithms, systems, and applications. In Data Mining Handbook. To be published, 2002. [214] Park B. and Kargupta H. Distributed Data Mining: Algorithms, Systems, and Applications. In The Handbook of Data Mining, edited by N. Ye, Lawrence Erlbaum Associates, pages 341358, 2003. [215] Park B., Kargupta H., Johnson E., Sanseverino E., Hershberger D., and Silvestre L. Distributed, Collaborative Data Analysis From Heterogeneous Sites Using a Scalable Evolutionary Technique. Applied Intelligence, 16(1), 2002. [216] The protein data bank (pdb). http://www.rcsb.org/pdb/Welcome.do. [217] M. P. Perrone and L. N Cooper. When networks disagree: Ensemble method for neural networks. In R. J. Mammone, editor, Neural Networks for Speech and Image processing. Chapman-Hall, 1993. [218] Ivan Janciak Peter Brezany, Andrzej Goscinski and A Min Tjoa. The Development of a Wisdom Autonomic Grid. In Knowledge Grid and Grid Intelligence, Beijing, Sept 26,2004 2004. [219] Peter Brezany, A Min Tjoa, Helmut Wanek, Alexander Woehrer. Mediators in the Architecture of Grid Information Systems. In Accepted for the Conference on Parallel Processing and Applied Mathematics, Czestochowa, Poland, Sept 7-10, 2003 2003. [220] Peter Brezany, Juergen Hofer, A Min Tjoa, Alexander Woehrer. GridMiner: An Infrastructure for Data Mining on Computational Grids. Accepted for the APAC Conference and Exhibition on Advanced Computing, Grid Applications and eResearch, Queensland Australia, 29 September - 2 October, 2003 2003. [221] Beth Plale. Framework for Bringing Data Streams to the Grid. In Amsterdam IOS Press, editor, Scientic Programming, volume 12, pages 213223, 2004. [222] Beth Plale. Using Global Snapshots to Access Data Streams on the Grid. In Springer Verlag, editor, 2nd EUROPEAN ACROSS GRIDS CONFERENCE (AxGrids 2004),Lecture Notes in Computer Science Series, volume 3165, 2004. [223] Beth Plale and Karsten Schwan. Dynamic Querying of Streaming Data with the dQUOB System. In IEEE Transactions on Parallel and Distributed Systems, volume 14, pages 422432, April 2003. 111

[224] Palomer Observatory Sky Survey. http://www.astro.caltech.edu/observatories/palomar/public/index.html. [225] Particle Physics Data Grid. http://www.ppdg.net/. [226] Provost F. Distributed Data Mining: Scaling Up and Beyond. In Advances in Distributed and Parallel Knowledge Discovery, edited by H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha, MIT/AAAI Press, pages 327, 2000. [227] J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81106, 1986. [228] R. Jacob, C. Schafer, I. Foster, M. Tobis, J. Anderson. Computational Design and Performance of the Fast Ocean Atmosphere Model, Version One. In Intl Conference on Computational Science, 2001. [229] Rahul Ramachandran, John Rushing, Helen Conover, Sara J. Graves, Ken Keiser. Flexible Framework for Mining Meteorological Data. In American Meteorological Societys (AMS) 19th International Conference on Interactive Information Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology, Long Beach, CA, Feb. 9 - 13 2003. [230] Richard Kuntschke, Bernhard Stegmaier, Alfons Kemper, and Angelika Reiser. StreamGlobe: Processing and Sharing Data Streams in Grid-based P2P Infrastructures. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB 2005), pages 12591262, Trondheim, Norway, August 30 September 2 2005. [231] Rinat Khoussainov, Xin Zuo and Nicholas Kushmerick. Grid-enabled Weka: A Toolkit for Machine Learning on the Grid. In ERCIM News, volume No 59, October 2004. [232] C.J. Alonso Rodriguez J.J, L.I. Kuncheva. Rotation Forest: A new classier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006. [233] J. Rushing, R. Ramachandran, U. Nair, S. Graves, R. Welch, and H. Lin. ADaM: a data mining toolkit for scientists and engineers. Computers and Geosciences, 31:607618, jun 2005. [234] Parthasarathy S. and Ogihara M. Clustering Distributed Homogeneous Datasets. In Proceedings of the Fourth European Conference on Principles of Data Mining and Knowledge Discovery, volume 1910 of Springer-Verlag Lecture Notes in Computer Science, pages 566574, 2000. [235] S. Babu, L. Subramanian, and J. Widom. A Data Stream Management System for Network Trafc Management. In In Proceedings of Workshop on NetworkRelated Data Management (NRDM 2001), May 2001. [236] S. Datta, C. Giannella, H. Kargupta. K-Means Clustering over a Large, Dynamic Network. In Proceedings of the SIAM International Data Mining Conference, 2006. Accepted for publication. 112

[237] Samatova N., Ostrouchov G., Geist A., and Melechko A. RACHET: An Efcient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets. Distributed and Parallel Databases, 11(2):157180, 2002. [238] Sandi Redman. Mining on the Grid with ADaM. In Southeastern Universities Research Association (SURA) Targeted Communities Workshop Focus Study, Atlanta, GA, January 2005. [239] Sara J. Graves, Helen Conover, Ken Keiser, Rahul Ramachandran, Sandi Redman, John Rushing, Steve Tanner. Mining and Modeling in the Linked Environments for Atmospheric Discovery (LEAD). In Huntsville Simulation Conference, Huntsville, AL, Oct. 19, 2004 2004. [240] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using condence-rated predictions. Machine Learning, 37(3):297336, 1999. [241] Scientic Data Mining, Integration http://www.anc.ed.ac.uk/sdmiv/. and Visualization Workshop.

[242] Sloan Digital Sky Survey. http://www.sdss.org. [243] Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. The design and implementation of a sequence database system. In The VLDB Journal, pages 99 110, 1996. [244] Shafer J., Agrawal R., and Mehta M. SPRINT: A Scalable Parallel Classier for Data Mining. In Proceedings of the 22nd International Conference on Very Large Databases (VLDB), pages 544555, 1996. [245] Shyu M.-L., Chen S.-C., Sarinnapakorn K., and Chang L. A Novel Anomaly Detection Scheme Based on a Principal Component Classier. In Proceedings of the Foundations and New Directions of Data Mining Workshop, in Conjuction with the Thrid IEEE International Conference on Data Mining (ICDM), pages 172179, 2003. [246] The Spitre Project. http://edg-wp2.web.cern.ch/edg-wp2/spitre/index.html. [247] S. Stolfo et al. Jam: Java agents for meta-learning over distributed databases. In Proceedings Third International Conference on Knowledge Discovery and Data Mining, pages 7481, Menlo Park, CA, 1997. AAAI Press. [248] S. J. Stolfo, A. L. Prodromidis, S. Tselepis, W. Lee, D. W. Fan, and P. K. Chan. JAM: Java agent to meta-learning over distributed databases. In Proceedings on the Third International Conference on Knowledge Discovery and Data Mining, pages pages 7481, Newport Beach, CA, August 1997. AAAI Press. [249] W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea) for large-scale classicaiton. In Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 2001.

113

[250] Strehl A. and Ghosh J. Cluster Ensembles A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 3:583 617, 2002. [251] The swiss-prot protein knowledge base. http://www.expasy.org/sprot/. [252] Domenico Talia. Knowledge discovery services and tools on grids. In Foundation of Intelligent Systems - ISMIS 2003, volume 2871 of LNCS, pages 1423. Springer-Verlag, October 2003. [253] Domenico Talia and Paolo Truno. Toward a synergy between p2p and grids. IEEE Internet Computing, 7(4):9496, 2003. [254] Domenico Talia and Paolo Truno. Adapting a pure decentralized peer-to-peer protocol for grid services invocation. Parallel Processing Letters (PPL), 15(12):6784, 2005. [255] Ian Taylor, Ian Wang, Matthew Shields, and Shalil Majithia. Distributed computing with Triana on the Grid. Concurrency and Computation:Practice and Experience, 17(118), 2005. [256] Telemedicine. www.escience.cam.ac.uk/projects/telemed. [257] Teragrid. http://www.teragrid.org/about/. [258] The teragrid primer, sept 2002. http://www.teragrid.org/about/. [259] Tho Manh Nguyen, A Min Tjoa, Guenter Kickinger, Peter Brezany. Towards Service Collaboration Model in Grid-based Zero Latency Data Stream Warehouse (GZLDSWH). In Services Computing,2004 IEEE International Conference on (SCC04), pages 357365, 2004. [260] The Triana project. http://www.trianacode.org/. [261] K. Tumer and J. Ghosh. Robust order statistics based ensembles for distributed data mining. In H. Kargupta and P. K. Chan, editors, Advances in distributed and Parallel Knowledge Discovery, pages 185210. MIT, 2000. [262] Tumer K. and Ghosh J. Robust Order Statistics Based Ensembles for Distributed Data Mining. In Kargupta H. and Chan P., editors, Advances in Distributed and Parallel Knowledge Discovery, pages 185210. MIT/AAAI Press, 2000. [263] US National Virtual Observatory. http://us-vo.org/. [264] V. Curcin, M. Ghanem, Y. Guo, M. Kohler, A. Rowe, J. Syed, P. Wendel. Discovery Net: Towards a Grid of Knowledge Discovery. In Proc of The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD-2002, Edmonton, Alberta, Canada, July 23 - 26 2002. [265] Alexander S. Szalay; Jim Gray; Jan Vandenberg. Petabyte Scale Data Mining: Dream or Reality? SIPE Astronomy Telescopes and Instruments, 22-28 August 2002. 114

[266] Vivometrics life shirt garment. http://www.vivometrics.com. [267] Gregor von Laszewski and Ian Foster. Grid Infrastructure to Support Science Portals for Large Scale Instruments. In Proceedings of the Workshop Distributed Computing on the Web (DCW), pages 116. University of Rostock, Germany, 21-23 June 1999. [268] W. Allcock. GridFTP Protocol Specication. In Global Grid Forum Recommendation GFD.20, March 2003. [269] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster. The Globus Striped GridFTP Framework and Server. In Proceedings of Super Computing 2005 (SC05), November 2005. [270] W. Hoschek and G. McCance. Grid Enabled Relational Database Middleware. In Global Grid Forum(GGF), 2001. [271] P. Watson. Databases and the Grid. Technical Report UKeS-2002-01, UK eScience Programme Technical Report Series, December 2001. [272] Wearable west. http://www.smartextiles.info. [273] Weka Toolkit. http://www.cs.waikato.ac.nz/ ml/. [274] Weka4WS. http://grid.deis.unical.it/weka4ws. [275] Wolfgang Hoschek, Javier Jaen-Martinez, Asad Samar, Heinz Stockinger, Kurt Stockinger. Data Management in an International Data Grid Project. In IEEE/ACM International Workshop on Grid Computing Grid2000, December 2000. [276] D. Wolpert. Stacked generalization. Neural Networks, 5:241259, 1992. [277] Ying Liu, Beth Plale, and Nithya Vijayakumar. Distributed Streaming Query Planner in Calder System. In IEEE High Performance Distributed Computing (HPDC), Raleigh North Carolina, July 2005. [278] Zaki M. Parallel and Distributed Association Mining: A Survey. IEEE Concurrency, 7(4):1425, 1999. [279] Zaki M. Parallel and Distributed Data Mining: An Introduction. In Large-Scale Parallel Data Mining (Lecture Notes in Articial Intelligence 1759), edited by Zaki M. and Ho C.-T., Springer-Verlag, Berlin, pages 123, 2000.

115

Vous aimerez peut-être aussi