Ieee 2011 Java & .Net Projects at SBGC (Chennai, Trichy, India, Tamilnadu

SEABIRDS GROUP & CO
HEAD OFFICE ADDRESS
4th FLOOR SURYA COMPLEX,
SINGARATHOPE BUS STOP,,OLD MADURAI ROAD,
TRICHY- 620002
0431-4012303, contact@sbgc.in
09003012150
www.sbgc.in/projects
http://ieee2009projects.sbgc.in/
BRANCH OFFICE
NEAR TO AMBAL MATRICULATION SCHOOL
No:23,2nd Main Road, Jagannatha Nagar
Arumbakkam, CHENNAI-106
09944361169
1. Multirelational k-Anonymity --- dotnet/java
Copyright 2009 Sea Birds Group & Co. All rights reserved,
4th Floor Surya complex, Old Madurai Road, Trichy-620002. mailto:-contact@sbgc.in
Nergiz, Mehmet Ercan Clifton, Christopher Nergiz, Ahmet Erhan
Sabanci University, Istanbul, Purdue University, West Lafayette;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions on
Publication Date: Aug. 2009
Volume: 21, Issue: 8
On page(s): 1104-1117
ISSN: 1041-4347
Digital Object Identifier: 10.1109/TKDE.2008.210
First Published: 2008-10-17
Current Version Published: 2009-06-23
Abstract
k-Anonymity protects privacy by ensuring that data cannot be linked to a single
individual. In a k-anonymous data set, any identifying information occurs in at least k
tuples. Much research has been done to modify a single-table data set to satisfy
anonymity constraints. This paper extends the definitions of k-anonymity to multiple
relations and shows that previously proposed methodologies either fail to protect privacy
or overly reduce the utility of the data in a multiple relation setting. We also propose two
new clustering algorithms to achieve multirelational anonymity. Experiments show the
effectiveness of the approach in terms of utility and efficiency.
2.Olex: Effective Rule Learning for Text Categorization – java /dotnet

Rullo, Pasquale Policicchio, Veronica Lucia Cumbo, Chiara Iiritano, Salvatore
University of Calabria, Rende;
On page(s): 1118-1132
ISSN: 1041-4347
Abstract
This paper describes Olex, a novel method for the automatic induction of rule-based text
classifiers. Olex supports a hypothesis language of the form "if T_{1} or cdots or T_{n}
occurs in document d, and none of T_{n + 1}, ldots T_{n + m} occurs in d, then classify
d under category c,” where each T_{i} is a conjunction of terms. The proposed method is
simple and elegant. Despite this, the results of a systematic experimentation performed on
the Reuters-21578, the Ohsumed, and the ODP data collections show that Olex provides
classifiers that are accurate, compact, and comprehensible. A comparative analysis
conducted against some of the most well-known learning algorithms (namely, Naive
Bayes, Ripper, C4.5, SVM, and Linear Logistic Regression) demonstrates that it is more
than competitive in terms of both predictive accuracy and efficiency.
3..Ranking and Suggesting Popular Items -java /dotnet

Vojnović , Milan Cruise, James Gunawardena, Dinan Marbach, Peter
Microsoft Research Ltd., Cambridge;
On page(s): 1133-1146
ISSN: 1041-4347
Abstract
We consider the problem of ranking the popularity of items and suggesting popular items
based on user feedback. User feedback is obtained by iteratively presenting a set of
suggested items, and users selecting items based on their own preferences either from this
suggestion set or from the set of all possible items. The goal is to quickly learn the true
popularity ranking of items (unbiased by the made suggestions), and suggest true popular
items. The difficulty is that making suggestions to users can reinforce popularity of some
items and distort the resulting item ranking. The described problem of ranking and
suggesting items arises in diverse applications including search query suggestions and tag
suggestions for social tagging systems. We propose and study several algorithms for
ranking and suggesting popular items, provide analytical results on their performance,
and present numerical results obtained using the inferred popularity of tags from a
month-long crawl of a popular social bookmarking service. Our results suggest that
lightweight, randomized update rules that require no special configuration parameters
provide good performance.
4.Similarity-Profiled Temporal Association Mining –java/dotnet

Yoo, Jin Soung Shekhar, Shashi
Indiana University-Purdue University, Fort Wayne;
On page(s): 1147-1161
ISSN: 1041-4347
Abstract
Given a time stamped transaction database and a user-defined reference sequence of
interest over time, similarity-profiled temporal association mining discovers all
associated item sets whose prevalence variations over time are similar to the reference
sequence. The similar temporal association patterns can reveal interesting relationships of
data items which co-occur with a particular event over time. Most works in temporal
association mining have focused on capturing special temporal regulation patterns such
as cyclic patterns and calendar scheme-based patterns. However, our model is flexible in
representing interesting temporal patterns using a user-defined reference sequence. The
dissimilarity degree of the sequence of support values of an item set to the reference
sequence is used to capture how well its temporal prevalence variation matches the
reference pattern. By exploiting interesting properties such as an envelope of support
time sequence and a lower bounding distance for early pruning candidate item sets, we
develop an algorithm for effectively mining similarity-profiled temporal association
patterns. We prove the algorithm is correct and complete in the mining results and
provide the computational analysis. Experimental results on real data as well as synthetic
data show that the proposed algorithm is more efficient than a sequential method using a
traditional support-pruning scheme.
5.RiMOM: A Dynamic Multistrategy Ontology Alignment Framework –java /dotnet
Li, Juanzi Tang, Jie Li, Yi Luo, Qiong

Tsinghua University, Beijing;
On page(s): 1218-1232
ISSN: 1041-4347
Abstract
Ontology alignment identifies semantically matching entities in different ontologies.
Various ontology alignment strategies have been proposed; however, few systems have
explored how to automatically combine multiple strategies to improve the matching
effectiveness. This paper presents a dynamic multistrategy ontology alignment
framework, named RiMOM. The key insight in this framework is that similarity
characteristics between ontologies may vary widely. We propose a systematic approach
to quantitatively estimate the similarity characteristics for each alignment task and
propose a strategy selection method to automatically combine the matching strategies
based on two estimated factors. In the approach, we consider both textual and structural
characteristics of ontologies. With RiMOM, we participated in the 2006 and 2007
campaigns of the Ontology Alignment Evaluation Initiative (OAEI). Our system is
among the top three performers in benchmark data sets.
6.IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL.

X, NO. X, XXXX 2009
Role Engineering via Prioritized -java
Subset Enumeration
Jaideep Vaidya, Member, IEEE, Vijayalakshmi Atluri, Senior Member, IEEE,
Janice Warner, Student Member, IEEE, and Qi Guo, Student Member, IEEE
Abstract—Today, role-based access control (RBAC) has become a well-accepted

paradigm for implementing access control because of its convenience and ease of
administration. However, in order to realize the full benefits of the RBAC paradigm, one
must first define the roles accurately. This task of defining roles and associating
permissions with them, also known as role engineering, is typically accomplished either
in a top-down or in a bottom-up manner. Under the top-down approach, a careful analysis
of the business processes is done to first define job functions and then to specify
appropriate roles from them. While this approach can help in defining roles more
accurately, it is tedious and time consuming since it requires that the semantics of the
business processes be well understood. Moreover, it ignores existing permissions within
an organization and does not utilize them. On the other hand, under the bottom-up
approach, existing permissions are used to derive roles from them. As a result, it may
help automate the process of role definition. In this paper, we present an unsupervised
approach, called RoleMiner, for mining roles from existing user-permission assignments.
Since a role, when semantics are unavailable, is nothing but a set of permissions, the task
of role mining is essentially that of clustering users having the same (or similar)
permissions. However, unlike the traditional applications of data mining that ideally
require identification of nonoverlapping clusters, roles will have overlapping permissions
and thus permission sets that define roles should be allowed to overlap. It is this
distinction from traditional clustering that makes the problem of role mining nontrivial.
Our experiments with real and simulated data sets indicate that our role mining process is
quite accurate and efficient. Since our role mining approach is based on subset
enumeration, it is fairly robust to reasonable levels of noise.
7. On the Effect of Location Uncertainty in Spatial Querying –java

Elias Frentzos, Kostas Gratsias, and Yannis Theodoridis, Member, IEEE
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21,

NO. 3, MARCH 2009
Abstract—An emerging topic in the field of spatial data management is the handling of
location uncertainty of spatial objects, mainly due to inaccurate measurements. The
literature on location uncertainty so far has focused on modifying traditional spatial
search algorithms in order to handle the impact of objects’ location uncertainty on the
query results. In this paper, we present the first, to the best of our knowledge, theoretical
analysis that estimates the average number of false hits introduced in the results of
rectangular range queries in the case of data points uniformly distributed in 2D space.
Then, we relax the original distribution assumptions showing how to deal with arbitrarily
distributed data points and more realistic location uncertainty distributions. The accuracy
of the results of our analytical approach is demonstrated through an extensive
experimental study using various synthetic and real data sets. Our proposal can be
directly employed in spatial database systems in order to provide users with the accuracy
of spatial query results based only on known data set and query parameters.
Index Terms—Spatial databases, GIS
8. Design and Evaluation of the iMed Intelligent Medical Search Engine –dotnet
Gang LuoIBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532,
USA
IEEE International Conference on Data Engineering-
1084-4627/09 $25.00 © 2009 IEEE
Abstract — Searching for medical information on the Web is popular and important.
However, medical search has its own unique requirements that are poorly handled by
existing medical Web search engines. This paper presents iMed, the first intelligent
medical Web search engine that extensively uses medical knowledge and questionnaire to
facilitate ordinary Internet users to search for medical information. iMed introduces and
extends expert system technology into the search engine domain. It uses several key
techniques to improve its usability and search result quality. First, since ordinary users
often cannot clearly describe their situations due to lack of medical background, iMed
uses a questionnaire-based query interface to guide searchers to provide the most
important information about their situations. Second, iMed uses medical knowledge to
automatically form multiple queries from a searcher’ answers to the questions. Using
these queries to perform search can significantly improve the quality of search results.
Third, iMed structures all the search results into a multilevel hierarchy with explicitly
marked medical meanings to facilitate searchers’ viewing. Lastly, iMed suggests
diversified,related medical phrases at each level of the search result hierarchy. These
medical phrases are extracted from the MeSH ontology and can help searchers quickly
digest search results and refine their inputs. We evaluated iMed under a wide range of
medical scenarios. The results show that iMed is effective and efficient for medical
search.
9. SPOT Databases: Efficient Consistency Checking and Optimistic Selection in

Probabilistic Spatial Databases –java /dotnet
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21,
NO. 1, JANUARY 2009
ABSTRACT
Spatial PrObabilistic Temporal (SPOT) databases are a paradigm for reasoning
with probabilistic statements about where a vehicle may be now or in the future. They
express statements of the form “Object O is in spatial region R at some time t with some
probability in the interval [L, U].” Past work on SPOT databases has developed selection
operators based on selecting SPOT atoms that are entailed by the SPOT database—we
call this “cautious” selection. In this paper, we study several problems. First, we note that
the runtime of consistency checking and cautious selection algorithms in past work is
influenced greatly by the granularity of the underlying Cartesian space. In this paper, we
first introduce the notion of “optimistic” selection, where we are interested in returning
all SPOT atoms in a database that are consistent with respect to a query, rather than
having an entailment relationship. We then develop an approach to scaling SPOT
databases that has three main contributions: 1) We develop methods to eliminate
variables from the linear programs used in past work, thus greatly reducing the size of the
linear programs used—the resulting advances apply to consistency checking, optimistic
selection, and cautious selection. 2) We develop a host of theorems to show how we can
prune the search space when we are interested in optimistic selection. 3) We use the
above contributions to build an efficient index to execute optimistic selection queries
over SPOT databases. Our approach is superior to past work in two major respects: First,
it makes fewer assumptions than all past works. Second, our experiments, which are
based on real-world data about ship movements, show that our algorithms are much more
efficient.
10 .Robust blind watermarking mechanism for point sampled geometry –java
Robust Blind Watermarking of Point-Sampled Geometry

Agarwal, P.; Prabhakaran, B.
Information Forensics and Security, IEEE Transactions on
Volume 4, Issue 1, Date: March 2009, Pages: 36-48
Digital Object Identifier 10.1109/TIFS.2008.2011081
ABSTRACT
Watermarking schemes for copyright protection of point cloud representation of 3D
models operate only on the geometric data, and are also applicable to mesh based
representations of 3D models, defined using geometry and topological information. For
building such generic copyright schemes for 3D models, this paper presents a robust
spatial blind watermarking mechanism for 3D point sampled geometry. To find the order
in which points are to be encoded/decoded, a clustering approach is proposed. The points
are divided into clusters, and ordering is achieved using inter-cluster and intra-cluster
ordering. Inter-cluster ordering achieves local ordering of points, whereas intra-cluster
ordering does it globally. Once ordered, a sequence of clusters is chosen based on nearest
neighbor heuristic. An extension of quantization index of bit encoding scheme is
proposed, and used to encode and decode inside the clusters. The encoding mechanism
makes the technique robust against uniform affine transformations (rotation, scaling, and
transformation), reordering attack and topology altering (e.g. retriangulation) attack when
applied to 3D meshes as well. Replication of watermark provides robustness against
localized noise addition, cropping, simplification and global noise addition attacks.
Security of the scheme is analyzed, and the time complexity is estimated as O (n log n),
where n is the number of 3D points. Theoretical bounds on hiding capacity are estimated,
and experiments show that a high hiding capacity is high, with embedding rate greater
than 3 bits/point. The bit encoding method reduces the distortions and makes the
watermark imperceptible, indicated by a signal to noise ratio greater than 100 dB.
11. A Hypothesis Testing Approach to Semifragile Watermark-Based

-Authentication –java
Chuhong Fei; Kwong, R.H.; Kundur, D.
Volume 4, Issue 2, Date: June 2009, Pages: 179-192
This paper studies the problem of achieving watermark semi-fragility in multimedia

authentication through a composite hypothesis testing approach. The embedding of a
semi-fragile watermark serves to distinguish legitimate distortions caused by signal
processing manipulations from illegitimate ones caused by malicious tampering. This
leads us to consider authentication verification as a composite hypothesis testing problem
with the watermark as a priori information. Based on the hypothesis testing model, we
investigate the best embedding strategy which assists the watermark verifier to make
correct decisions. Our results show that the quantization; based watermarking method is
more appropriate than the spread spectrum method to achieve the best tradeoff between
two error probabilities. This observation is confirmed by a case study of additive
Gaussian white noise channel with Gaussian source using two figures of merit: relative
entropy of the two hypothesis distributions and the receiver operating characteristic.
Finally, we focus on certain common signal processing distortions such as JPEG
compression and image filtering, and investigate the best test statistic and optimal
decision regions to distinguish legitimate and illegitimate distortions. The results of the
paper show that our approach provides insights for authentication watermarking and
allows better control of semi-fragility in specific applications.
12 Spread-Spectrum Watermarking Security –java./dotnet

Perez-Freire, L.; Perez-Gonzalez, F.
Abstract
This paper presents both theoretical and practical analyses of the security offered by
watermarking and data hiding methods based on spread spectrum. In this context,
security is understood as the difficulty of estimating the secret parameters of the
embedding function based on the observation of watermarked signals. On the theoretical
side, the security is quantified from an information-theoretic point of view by means of
the equivocation about the secret parameters. The main results reveal fundamental limits
and bounds on security and provide insight into other properties, such as the impact of the
embedding parameters, and the tradeoff between robustness and security. On the practical
side, workable estimators of the secret parameters are proposed and theoretically
analyzed for a variety of scenarios, providing a comparison with previous approaches,
and showing that the security of many schemes used in practice can be fairly low.
13.A Game Theoretical Framework on Intrusion Detection in Heterogeneous

Networks –java
Lin Chen; Leneutre, J.
Abstract
Due to the dynamic, distributed, and heterogeneous nature of today's networks, intrusion
detection systems (IDSs) have become a necessary addition to the security infrastructure
and are widely deployed as a complementary line of defense to classical security
approaches. In this paper, we address the intrusion detection problem in heterogeneous
networks consisting of nodes with different noncorrelated security assets. In our study,
two crucial questions are: What are the expected behaviors of rational attackers? What is
the optimal strategy of the defenders (IDSs)? We answer the questions by formulating the
network intrusion detection as a noncooperative game and performing an in-depth
analysis on the Nash equilibrium and the engineering implications behind. Based on our
game theoretical analysis, we derive the expected behaviors of rational attackers, the
minimum monitor resource requirement, and the optimal strategy of the defenders. We
then provide guidelines for IDS design and deployment. We also show how our game
theoretical framework can be applied to configure the intrusion detection strategies in
realistic scenarios via a case study. Finally, we evaluate the proposed game theoretical
framework via simulations. The simulation results show both the correctness of the
analytical results and the effectiveness of the proposed guidelines.
14 Unseen Visible Watermarking: A Novel Methodology for Auxiliary Information

Delivery via Visual Contents –java /dotnet
Chun-Hsiang Huang; Shang-Chih Chuang; Yen-Lin Huang; Ja-Ling Wu
Abstract
A novel data hiding scheme, denoted as unseen visible watermarking (UVW), is
proposed. In UVW schemes, hidden information can be embedded covertly and then
directly extracted using the human visual system as long as appropriate operations (e.g.,
gamma correction provided by almost all display devices or changes in viewing angles
relative to LCD monitors) are performed. UVW eliminates the requirement of invisible
watermarking that specific watermark extractors must be deployed to the receiving end in
advance, and it can be integrated with 2-D barcodes to transmit machine-readable
information that conventional visible watermarking schemes fail to deliver. We also
adopt visual cryptographic techniques to guard the security of hidden information and, at
the same time, increase the practical value of visual cryptography. Since UVW can be
alternatively viewed as a mechanism for visualizing patterns hidden with least-
significant-bit embedding, its security against statistical steganalysis is proved by
empirical tests. Limitations and other potential extensions of UVW are also addressed.
15 A Fast Search Algorithm for a Large Fuzzy Database –java/dotnet

Feng Hao; Daugman, J.; Zielinski, P.
Abstract
In this paper, we propose a fast search algorithm for a large fuzzy database that stores iris
codes or data with a similar binary structure. The fuzzy nature of iris codes and their high
dimensionality render many modern search algorithms, mainly relying on sorting and
hashing, inadequate. The algorithm that is used in all current public deployments of iris
recognition is based on a brute force exhaustive search through a database of iris codes,
looking for a match that is close enough. Our new technique, Beacon Guided Search
(BGS), tackles this problem by dispersing a multitude of ldquobeaconsrdquo in the
search space. Despite random bit errors, iris codes from the same eye are more likely to
collide with the same beacons than those from different eyes. By counting the number of
collisions, BGS shrinks the search range dramatically with a negligible loss of precision.
We evaluate this technique using 632,500 iris codes enrolled in the United Arab Emirates
(UAE) border control system, showing a substantial improvement in search speed with a
negligible loss of accuracy. In addition, we demonstrate that the empirical results match
theoretical predictions.
16 Digital Image Forensics via Intrinsic Fingerprints-java
Swaminathan, A.; Min Wu; Liu, K.J.R.
Abstract
Digital imaging has experienced tremendous growth in recent decades, and digital camera
images have been used in a growing number of applications. With such increasing
popularity and the availability of low-cost image editing software, the integrity of digital
image content can no longer be taken for granted. This paper introduces a new
methodology for the forensic analysis of digital camera images. The proposed method is
based on the observation that many processing operations, both inside and outside
acquisition devices, leave distinct intrinsic traces on digital images, and these intrinsic
fingerprints can be identified and employed to verify the integrity of digital data. The
intrinsic fingerprints of the various in-camera processing operations can be estimated
through a detailed imaging model and its component analysis. Further processing applied
to the camera captured image is modelled as a manipulation filter, for which a blind
deconvolution technique is applied to obtain a linear time-invariant approximation and to
estimate the intrinsic fingerprints associated with these postcamera operations. The
absence of camera-imposed fingerprints from a test image indicates that the test image is
not a camera output and is possibly generated by other image production processes. Any
change or inconsistencies among the estimated camera-imposed fingerprints, or the
presence of new types of fingerprints suggest that the image has undergone some kind of
processing after the initial capture, such as tampering or steganographic embedding.
Through analysis and extensive experimental studies, this paper demonstrates the
effectiveness of the proposed framework for nonintrusive digital image forensics.
17 Watermarking Robustness Evaluation Based on Perceptual Quality via Genetic

Algorithms
Boato, G.; Conotter, V.; De Natale, F.G.B.; Fontanari, C.
Abstract
This paper presents a novel and flexible benchmarking tool based on genetic algorithms
(GA) and designed to assess the robustness of any digital image watermarking system.
The main idea is to evaluate robustness in terms of perceptual quality, measured by
weighted peak signal-to-noise ratio. Through a stochastic approach, we optimize this
quality metric, by finding the minimal degradation that needs to be introduced in a
marked image in order to remove the embedded watermark. Given a set of attacks,
chosen according to the considered application scenario, GA support the optimization of
the parameters to be assigned to each processing operation, in order to obtain an
unmarked image with perceptual quality as high as possible. Extensive experimental
results demonstrate the effectiveness of the proposed evaluation tool.
18 Low-Complexity Iris Coding and Recognition Based on Directionlets –java

Velisavljevic, V
Volume PP, Issue 99, Date: 0, Pages: 1-1
Abstract
A novel iris recognition method is presented. In the method, the iris features are extracted
using the oriented separable wavelet transforms (directionlets) and they are compared in
terms of a weighted Hamming distance. The feature extraction and comparison are shift-,
size- and rotation-invariant to the location of iris in the acquired image. The generated
iris code is binary, whose length is fixed (and therefore commensurable), independent of
the iris image, and comparatively short. The novel method shows a good performance
when applied to a large database of irises and provides reliable identification and
verification. At the same time, it preserves conceptual and computational simplicity and
allows for a quick analysis and comparison of iris samples.
19 Lanczos Vectors versus Singular Vectors for Effective Dimension Reduction –

java/dotnet
Chen, Jie Saad, Yousef
University of Minnesota, Minneapolis;
On page(s): 1091-1103
ISSN: 1041-4347
Abstract
This paper takes an in-depth look at a technique for computing filtered matrix-vector
(mat-vec) products which are required in many data analysis applications. In these
applications, the data matrix is multiplied by a vector and we wish to perform this
product accurately in the space spanned by a few of the major singular vectors of the
matrix. We examine the use of the Lanczos algorithm for this purpose. The goal of the
method is identical with that of the truncated singular value decomposition (SVD),
namely to preserve the quality of the resulting mat-vec product in the major singular
directions of the matrix. The Lanczos-based approach achieves this goal by using a small
number of Lanczos vectors, but it does not explicitly compute singular values/vectors of
the matrix. The main advantage of the Lanczos-based technique is its low cost when
compared with that of the truncated SVD. This advantage comes without sacrificing
accuracy. The effectiveness of this approach is demonstrated on a few sample
applications requiring dimension reduction, including information retrieval and face
recognition. The proposed technique can be applied as a replacement to the truncated
SVD technique whenever the problem can be formulated as a filtered mat-vec
multiplication.
20 Optimal-Location-Selection Query Processing in Spatial Databases –java

Gao, Yunjun Zheng, Baihua Chen, Gencai Li, Qing
Singapore Management University, Singapore;
On page(s): 1162-1177
ISSN: 1041-4347
Abstract
This paper introduces and solves a novel type of spatial queries, namely, Optimal-
Location-Selection (OLS) search, which has many applications in real life. Given a data
object set D_A, a target object set D_B, a spatial region R, and a critical distance d_c in a
multidimensional space, an OLS query retrieves those target objects in D_B that are
outside R but have maximal optimality. Here, the optimality of a target object b in D_B
located outside R is defined as the number of the data objects from D_A that are inside R
and meanwhile have their distances to b not exceeding d_c. When there is a tie, the
accumulated distance from the data objects to b serves as the tie breaker, and the one with
smaller distance has the better optimality. In this paper, we present the optimality metric,
formalize the OLS query, and propose several algorithms for processing OLS queries
efficiently. A comprehensive experimental evaluation has been conducted using both real
and synthetic data sets to demonstrate the efficiency and effectiveness of the proposed
algorithms.
21 Evaluating the Effectiveness of Personalized Web Search –java/dotnet

Dou, Zhicheng Song, Ruihua Wen, Ji-Rong Yuan, Xiaojie
Microsoft Research Asia, Beijing;
On page(s): 1178-1190
ISSN: 1041-4347
Abstract
Although personalized search has been under way for many years and many
personalization algorithms have been investigated, it is still unclear whether
personalization is consistently effective on different queries for different users and under
different search contexts. In this paper, we study this problem and provide some findings.
We present a large-scale evaluation framework for personalized search based on query
logs and then evaluate five personalized search algorithms (including two click-based
ones and three topical-interest-based ones) using 12-day query logs of Windows Live
Search. By analyzing the results, we reveal that personalized Web search does not work
equally well under various situations. It represents a significant improvement over
generic Web search for some queries, while it has little effect and even harms query
performance under some situations. We propose click entropy as a simple measurement
on whether a query should be personalized. We further propose several features to
automatically predict when a query will benefit from a specific personalization algorithm.
Experimental results show that using a personalization algorithm for queries selected by
our prediction model is better than using it simply for all queries.
22 A Divide-and-Conquer Approach for Minimum Spanning Tree-Based

Clustering –java
Xiaochun Wang Xiali Wang Wilkes, D.M.
Dept. of Electr. Eng. & Comput. Sci., Vanderbilt Univ., Nashville, TN;
Publication Date: July 2009
On page(s): 945-958
ISSN: 1041-4347
INSPEC Accession Number: 10666309
Abstract
Due to their ability to detect clusters with irregular boundaries, minimum spanning tree-
based clustering algorithms have been widely used in practice. However, in such
clustering algorithms, the search for nearest neighbor in the construction of minimum
spanning trees is the main source of computation and the standard solutions take O(N2)
time. In this paper, we present a fast minimum spanning tree-inspired clustering
algorithm, which, by using an efficient implementation of the cut and the cycle property
of the minimum spanning trees, can have much better performance than O(N2).
23 Determining Attributes to Maximize Visibility of Objects –java

Miah, M. Das, G. Hristidis, V. Mannila, H.
Dept. of Comput. Sci. & Eng., Univ. of Texas at Arlington, Arlington, TX;
On page(s): 959-973
ISSN: 1041-4347
Abstract
In recent years, there has been significant interest in the development of ranking
functions and efficient top-k retrieval algorithms to help users in ad hoc search and
retrieval in databases (e.g., buyers searching for products in a catalog). We introduce a
complementary problem: How to guide a seller in selecting the best attributes of a new
tuple (e.g., a new product) to highlight so that it stands out in the crowd of existing
competitive products and is widely visible to the pool of potential buyers. We develop
several formulations of this problem. Although the problems are NP-complete, we give
several exact and approximation algorithms that work well in practice. One type of exact
algorithms is based on integer programming (IP) formulations of the problems. Another
class of exact methods is based on maximal frequent item set mining algorithms. The
approximation algorithms are based on greedy heuristics. A detailed performance study
illustrates the benefits of our methods on real and synthetic data.
24 Efficient Skyline Computation in Structured Peer-to-Peer Systems –java

Bin Cui Lijiang Chen Linhao Xu Hua Lu Guojie Song Quanqing Xu
Dept. of Comput. Sci., Peking Univ., Beijing;
On page(s): 1059-1072
ISSN: 1041-4347
Abstract
An increasing number of large-scale applications exploit peer-to-peer network
architecture to provide highly scalable and flexible services. Among these applications,
data management in peer-to-peer systems is one of the interesting domains. In this paper,
we investigate the multidimensional skyline computation problem on a structured peer-
to-peer network. In order to achieve low communication cost and quick response time,
we utilize the iMinMax(theta ) method to transform high-dimensional data to one-
dimensional value and distribute the data in a structured peer-to-peer network called
BATON. Thereafter, we propose a progressive algorithm with adaptive filter technique
for efficient skyline computation in this environment. We further discuss some
optimization techniques for the algorithm, and summarize the key principles of our
algorithm into a query routing protocol with detailed analysis. Finally, we conduct an
extensive experimental evaluation to demonstrate the efficiency of our approach.
25 A Communication Perspective on Automatic Text Categorization –java/dotnet

Capdevila, M. Florez, O.W.M.
Signal & Commun. Process. Dept., Univ. of Vigo, Vigo;
On page(s): 1027-1041
ISSN: 1041-4347
Abstract
The basic concern of a Communication System is to transfer information from its source
to a destination some distance away. Textual documents also deal with the transmission
of information. Particularly, from a text categorization system point of view, the
information encoded by a document is the topic or category it belongs to. Following this
initial intuition, a theoretical framework is developed where Automatic Text
Categorization(ATC) is studied under a Communication System perspective. Under this
approach, the problematic indexing feature space dimensionality reduction has been
tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a
subsequent redundant terms compression. Gaussian probabilistic categorizers have been
revisited and adapted to the concomitance of sparsity in ATC. Experimental results
pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical
approaches. The noise filter and redundancy compressor allows an aggressive term
vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower
than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification
accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results
similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and
Support Vector Machines (SVMs).
26 Predicting Missing Items in Shopping Carts –j2ee/dotnet

Wickramaratna, K. Kubat, M. Premaratne, K.
Dept. of Electr. & Comput. Eng., Univ. of Miami, Coral Gables, FL;
On page(s): 985-998
ISSN: 1041-4347
Abstract
Existing research in association mining has focused mainly on how to expedite the search
for frequently co-occurring groups of items in ldquoshopping cartrdquo type of
transactions; less attention has been paid to methods that exploit these ldquofrequent
itemsetsrdquo for prediction purposes. This paper contributes to the latter task by
proposing a technique that uses partial information about the contents of a shopping cart
for the prediction of what else the customer is likely to buy. Using the recently proposed
data structure of itemset trees (IT-trees), we obtain, in a computationally efficient
manner, all rules whose antecedents contain at least one item from the incomplete
shopping cart. Then, we combine these rules by uncertainty processing techniques,
including the classical Bayesian decision theory and a new algorithm based on the
Dempster-Shafer (DS) theory of evidence combination.
27 Discovery of Structural and Functional Features in RNA Pseudoknots-

java/dotnet
Chen, Q. Chen, Y.-P.P.
Fac. of Sci. & Technol., Deakin Univ., VIC;
On page(s): 974-984
ISSN: 1041-4347
Abstract
An RNA pseudoknot consists of nonnested double-stranded stems connected by single-
stranded loops. There is increasing recognition that RNA pseudoknots are one of the most
prevalent RNA structures and fulfill a diverse set of biological roles within cells, and
there is an expanding rate of studies into RNA pseudoknotted structures as well as
increasing allocation of function. These not only produce valuable structural data but also
facilitate an understanding of structural and functional characteristics in RNA molecules.
PseudoBase is a database providing structural, functional, and sequence data related to
RNA pseudoknots. To capture the features of RNA pseudoknots, we present a novel
framework using quantitative association rule mining to analyze the pseudoknot data. The
derived rules are classified into specified association groups regarding structure, function,
and category of RNA pseudoknots. The discovered association rules assist biologists in
filtering out significant knowledge of structure-function and structure-category
relationships. A brief biological interpretation to the relationships is presented, and their
potential correlations with each other are highlighted.
28 Predictive Ensemble Pruning by Expectation Propagation –java

Huanhuan Chen Tiho, P. Xin Yao
Sch. of Comput. Sci., Univ. of Birmingham, Birmingham;
On page(s): 999-1013
ISSN: 1041-4347
Abstract
An ensemble is a group of learners that work together as a committee to solve a problem.
The existing ensemble learning algorithms often generate unnecessarily large ensembles,
which consume extra computational resource and may degrade the generalization
performance. Ensemble pruning algorithms aim to find a good subset of ensemble
members to constitute a small ensemble, which saves the computational resource and
performs as well as, or better than, the unpruned ensemble. This paper introduces a
probabilistic ensemble pruning algorithm by choosing a set of ldquosparserdquo
combination weights, most of which are zeros, to prune the ensemble. In order to obtain
the set of sparse combination weights and satisfy the nonnegative constraint of the
combination weights, a left-truncated, nonnegative, Gaussian prior is adopted over every
combination weight. Expectation propagation (EP) algorithm is employed to approximate
the posterior estimation of the weight vector. The leave-one-out (LOO) error can be
obtained as a by-product in the training of EP without extra computation and is a good
indication for the generalization error. Therefore, the LOO error is used together with the
Bayesian evidence for model selection in this algorithm. An empirical study on several
regression and classification benchmark data sets shows that our algorithm utilizes far
less component learners but performs as well as, or better than, the unpruned ensemble.
Our results are very competitive compared with other ensemble pruning algorithms.
29 Rough Cluster Quality Index Based on Decision Theory-java/dotnet

Lingras, P. Min Chen Duoqian Miao
Dept. of Math. & Comput. Sci., St. Mary's Univ., Halifax, NS;
On page(s): 1014-1026
ISSN: 1041-4347
Abstract
Quality of clustering is an important issue in application of clustering techniques. Most
traditional cluster validity indices are geometry-based cluster quality measures. This
paper proposes a cluster validity index based on the decision-theoretic rough set model
by considering various loss functions. Experiments with synthetic, standard, and real-
world retail data show the usefulness of the proposed validity index for the evaluation of
rough and crisp clustering. The measure is shown to help determine optimal number of
clusters, as well as an important parameter called threshold in rough clustering. The
experiments with a promotional campaign for the retail data illustrate the ability of the
proposed measure to incorporate financial considerations in evaluating quality of a
clustering scheme. This ability to deal with monetary values distinguishes the proposed
decision-theoretic measure from other distance-based measures. The proposed validity
index can also be extended for evaluating other clustering algorithms such as fuzzy
clustering.
30 ANGEL: Enhancing the Utility of Generalization for Privacy Preserving

Publication –java/dotnet
Yufei Tao Hekang Chen Xiaokui Xiao Shuigeng Zhou Donghui Zhang
Dept. of Comput. Sci. & Eng., Chinese Univ. of Hong Kong, Hong Kong;
On page(s): 1073-1087
ISSN: 1041-4347
Abstract
Generalization is a well-known method for privacy preserving data publication. Despite
its vast popularity, it has several drawbacks such as heavy information loss, difficulty of
supporting marginal publication, and so on. To overcome these drawbacks, we develop
ANGEL,1 a new anonymization technique that is as effective as generalization in privacy
protection, but is able to retain significantly more information in the microdata. ANGEL
is applicable to any monotonic principles (e.g., l-diversity, t-closeness, etc.), with its
superiority (in correlation preservation) especially obvious when tight privacy control
must be enforced. We show that ANGEL lends itself elegantly to the hard problem of
marginal publication. In particular, unlike generalization that can release only restricted
marginals, our technique can be easily used to publish any marginals with strong privacy
guarantees.
31 Clustering and Sequential Pattern Mining of Online Collaborative Learning

Data –j2ee /dotnet
Perera, D. Kay, J. Koprinska, I. Yacef, K. Zaiane, O.R.
Sch. of Inf. Technol., Univ. of Sydney, Sydney, NSW;
Publication Date: June 2009
On page(s): 759-772
ISSN: 1041-4347
Abstract
Group work is widespread in education. The growing use of online tools supporting
group work generates huge amounts of data. We aim to exploit this data to support
mirroring: presenting useful high-level views of information about the group, together
with desired patterns characterizing the behavior of strong groups. The goal is to enable
the groups and their facilitators to see relevant aspects of the group's operation and
provide feedback if these are more likely to be associated with positive or negative
outcomes and indicate where the problems are. We explore how useful mirror
information can be extracted via a theory-driven approach and a range of clustering and
sequential pattern mining. The context is a senior software development project where
students use the collaboration tool TRAC. We extract patterns distinguishing the better
from the weaker groups and get insights in the success factors. The results point to the
importance of leadership and group interaction, and give promising indications if they are
occurring. Patterns indicating good individual practices were also identified. We found
that some key measures can be mined from early data. The results are promising for
advising groups at the start and early identification of effective and poor practices, in
time for remediation.
32 Monitoring Online Tests through Data Visualization –j2ee/dotnet

Costagliola, G. Fuccella, V. Giordano, M. Polese, G.
Dipt. di Mat. ed Inf., Univ. degli Studi di Salerno, Fisciano;
On page(s): 773-784
ISSN: 1041-4347
Abstract
We present an approach and a system to let tutors monitor several important aspects
related to online tests, such as learner behavior and test quality. The approach includes
the logging of important data related to learner interaction with the system during the
execution of online tests and exploits data visualization to highlight information useful to
let tutors review and improve the whole assessment process. We have focused on the
discovery of behavioral patterns of learners and conceptual relationships among test
items. Furthermore, we have led several experiments in our faculty in order to assess the
whole approach. In particular, by analyzing the data visualization charts, we have
detected several previously unknown test strategies used by the learners. Last, we have
detected several correlations among questions, which gave us useful feedbacks on the test
quality.
33.
Communities and Emerging Semantics in Semantic Link Network: Discovery and

Learning
Hai Zhuge
Inst. of Comput. Technol., Chinese Acad. of Sci., Beijing;
On page(s): 785-799
ISSN: 1041-4347
Abstract
The World Wide Web provides plentiful contents for Web-based learning, but its
hyperlink-based architecture connects Web resources for browsing freely rather than for
effective learning. To support effective learning, an e-learning system should be able to
discover and make use of the semantic communities and the emerging semantic relations
in a dynamic complex network of learning resources. Previous graph-based community
discovery approaches are limited in ability to discover semantic communities. This paper
first suggests the semantic link network (SLN), a loosely coupled semantic data model
that can semantically link resources and derive out implicit semantic links according to a
set of relational reasoning rules. By studying the intrinsic relationship between semantic
communities and the semantic space of SLN, approaches to discovering reasoning-
constraint, rule-constraint, and classification-constraint semantic communities are
proposed. Further, the approaches, principles, and strategies for discovering emerging
semantics in dynamic SLNs are studied. The basic laws of the semantic link network
motion are revealed for the first time. An e-learning environment incorporating the
proposed approaches, principles, and strategies to support effective discovery and
learning is suggested.
34 Toward a Fuzzy Domain Ontology Extraction Method for Adaptive e-Learning

Lau, R.Y.K. Dawei Song Yuefeng Li Cheung, T.C.H. Jin-Xing Hao
Dept. of Inf. Syst., City Univ. of Hong Kong, Kowloon;
On page(s): 800-813
ISSN: 1041-4347
Abstract
With the widespread applications of electronic learning (e-Learning) technologies to
education at all levels, increasing number of online educational resources and messages
are generated from the corresponding e-Learning environments. Nevertheless, it is quite
difficult, if not totally impossible, for instructors to read through and analyze the online
messages to predict the progress of their students on the fly. The main contribution of this
paper is the illustration of a novel concept map generation mechanism which is
underpinned by a fuzzy domain ontology extraction algorithm. The proposed mechanism
can automatically construct concept maps based on the messages posted to online
discussion forums. By browsing the concept maps, instructors can quickly identify the
progress of their students and adjust the pedagogical sequence on the fly. Our initial
experimental results reveal that the accuracy and the quality of the automatically
generated concept maps are promising. Our research work opens the door to the
development and application of intelligent software tools to enhance e-Learning.
35
Open Smart Classroom: Extensible and Scalable Learning System in Smart Space Using
Web Service Technology
Yue Suo Miyata, N. Morikawa, H. Ishida, T. Yuanchun Shi
Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing;
On page(s): 814-828
ISSN: 1041-4347
Abstract
Real-time interactive virtual classroom with teleeducation experience is an important
approach in distance learning. However, most current systems fail to meet new
challenges in extensibility and scalability, which mainly lie with three issues. First, an
open system architecture is required to better support the integration of increasing
human-computer interfaces and personal mobile devices in the classroom. Second, the
learning system should facilitate opening its interfaces, which will help easy deployment
that copes with different circumstances and allows other learning systems to talk to each
other. Third, problems emerge on binding existing systems of classrooms together in
different places or even different countries such as tackling systems intercommunication
and distant intercultural learning in different languages. To address these issues, we build
a prototype application called Open Smart Classroom built on our software infrastructure
based on the multiagent system architecture using Web Service technology in Smart
Space. Besides the evaluation of the extensibility and scalability of the system, an
experiment connecting two Open Smart Classrooms deployed in different countries is
also undertaken, which demonstrates the influence of these new features on the
educational effect. Interesting and optimistic results obtained show a significant research
prospect for developing future distant learning systems.
36 NNexus: An Automatic Linker for Collaborative Web-Based Corpora

Gardner, J. Krowne, A. Xiong, L.
Dept. of Math. & Comput. Sci., Emory Univ., Atlanta, GA;
On page(s): 829-839
ISSN: 1041-4347
Abstract
In this paper, we introduce Noosphere Networked Entry eXtension and Unification
System (NNexus), a generalization of the automatic linking engine of Noosphere (at
PlanetMath.org) and the first system that automates the process of linking disparate
"encyclopediardquo entries into a fully connected conceptual network. The main
challenges of this problem space include: 1) linking quality (correctly identifying which
terms to link and which entry to link to with minimal effort on the part of users), 2)
efficiency and scalability, and 3) generalization to multiple knowledge bases and web-
based information environment. We present the NNexus approach that utilizes subject
classification and other metadata to address these challenges. We also present evaluation
results demonstrating the effectiveness and efficiency of the approach and discuss
ongoing and future directions of research.
37 Effective Collaboration with Information Sharing in Virtual Universities

Hua Wang Yanchun Zhang Jinli Cao
Dept. of Math. & Comput., Univ. of Southern Queensland, Toowoomba, QLD;
On page(s): 840-853
ISSN: 1041-4347
Abstract
A global education system, as a key area in future IT, has fostered developers to provide
various learning systems with low cost. While a variety of e-learning advantages has been
recognized for a long time and many advances in e-learning systems have been
implemented, the needs for effective information sharing in a secure manner have to date
been largely ignored, especially for virtual university collaborative environments.
Information sharing of virtual universities usually occurs in broad, highly dynamic
network-based environments, and formally accessing the resources in a secure manner
poses a difficult and vital challenge. This paper aims to build a new rule-based
framework to identify and address issues of sharing in virtual university environments
through role-based access control (RBAC) management. The framework includes a role-
based group delegation granting model, group delegation revocation model, authorization
granting, and authorization revocation. We analyze various revocations and the impact of
revocations on role hierarchies. The implementation with XML-based tools demonstrates
the feasibility of the framework and authorization methods. Finally, the current proposal
is compared with other related work.
38 Learning in an Ambient Intelligent World: Enabling Technologies and Practices

Xiang Li Ling Feng Lizhu Zhou Yuanchun Shi
Tsinghua Univ., Beijing;
On page(s): 910-924
ISSN: 1041-4347
Abstract
The rapid evolution of information and communication technology opens a wide
spectrum of opportunities to change our surroundings into an Ambient Intelligent (AmI)
world. AmI is a vision of future information society, where people are surrounded by a
digital environment that is sensitive to their needs, personalized to their requirements,
anticipatory of their behavior, and responsive to their presence. It emphasizes on greater
user friendliness, user empowerment, and more effective service support, with an aim to
bring information and communication technology to everyone, every home, every
business, and every school, thus improving the quality of human life. AmI
unprecedentedly enhances learning experiences by endowing the users with the
opportunities of learning in context, a breakthrough from the traditional education
settings. In this survey paper, we examine some major characteristics of an AmI learning
environment. To deliver a feasible and effective solution to ambient learning, we
overview a few latest developed enabling technologies in context awareness and
interactive learning. Associated practices are meanwhile reported. We also describe our
experience in designing and implementing a smart class prototype, which allows teachers
to simultaneously instruct both local and remote students in a context-aware and natural
way.
39Interactive Correction and Recommendation for Computer Language Learning and

Training
Pahl, C. Kenny, C.
Sch. of Comput., Dublin City Univ., Dublin;
On page(s): 854-866
ISSN: 1041-4347
Abstract
Active learning and training is a particularly effective form of education. In various
domains, skills are equally important to knowledge. We present an automated learning
and skills training system for a database programming environment that promotes
procedural knowledge acquisition and skills training. The system provides meaningful
knowledge-level feedback such as correction of student solutions and personalized
guidance through recommendations. Specifically, we address automated synchronous
feedback and recommendations based on personalized performance assessment. At the
core of the tutoring system is a pattern-based error classification and correction
component that analyzes student input in order to provide immediate feedback and in
order to diagnose student weaknesses and suggest further study material. A syntax-driven
approach based on grammars and syntax trees provides the solution for a semantic
analysis technique. Syntax tree abstractions and comparison techniques based on
equivalence rules and pattern matching are specific approaches.
40 Subontology-Based Resource Management for Web-Based e-Learning

Zhaohui Wu Yuxin Mao Huajun Chen
Sch. of Comput. Sci., Zhejiang Univ., Hangzhou;
On page(s): 867-880
ISSN: 1041-4347
Abstract
Recent advances in Web and information technologies have resulted in many e-learning
resources. There is an emerging requirement to manage and reuse relevant resources
together to achieve on-demand e-learning in the Web. Ontologies have become a key
technology for enabling semantic-driven resource management. We argue that to meet
the requirements of semantic-based resource management for Web-based e-learning, one
should go beyond using domain ontologies statically. In this paper, we provide a
semantic mapping mechanism to integrate e-learning databases by using ontology
semantics. Heterogeneous e-learning databases can be integrated under a mediated
ontology. Taking into account the locality of resource reuse, we propose to represent
context-specific portions from the whole ontology as sub-ontologies. We present a sub-
ontology-based approach for resource reuse by using an evolutionary algorithm. We also
conduct simulation experiments to evaluate the approach with a traditional Chinese
medicine e-learning scenario and obtain promising results.
41 Enhancing Learning Objects with an Ontology-Based Memory

Zouaq, A. Nkambou, R.
Univ. of Quebec at Montreal, Montreal, QC;
On page(s): 881-893
ISSN: 1041-4347
Abstract
The reusability in learning objects has always been a hot issue. However, we believe that
current approaches to e-Learning failed to find a satisfying answer to this concern. This
paper presents an approach that enables capitalization of existing learning resources by
first creating "content metadatardquo through text mining and natural language
processing and second by creating dynamically learning knowledge objects, i.e., active,
adaptable, reusable, and independent learning objects. The proposed model also suggests
integrating explicitly instructional theories in an on-the-fly composition process of
learning objects. Semantic Web technologies are used to satisfy such an objective by
creating an ontology-based organizational memory able to act as a knowledge base for
multiple training environments.
42 Providing Flexible Process Support to Project-Centered Learning
Ceri, S. Daniel, F. Matera, M. Raffio, A.
Dipt. di Elettron. e Inf., Politec. di Milano, Milan;
On page(s): 894-909
ISSN: 1041-4347
Abstract
While business process definition is becoming more and more popular as an instrument
for describing human activities, there is a growing need for software tools supporting
business process abstractions to help users organize and monitor their desktop work.
Tools are most effective when they embed some knowledge about the process, e.g., in
terms of the typical activities required by the process, so that users can execute the
activities without having to define them. Tools must be lightweight and flexible, so as to
enable users to create or change the process as soon as there is a new need. In this article,
we first describe an application-independent approach to flexible process support by
discussing the abstractions required for modeling, creating, enacting, and modifying
flexible processes. Then, we show our approach at work in the context of project-
centered learning. In this application, learners are challenged to perform concrete tasks in
order to master specific subjects; in doing so, they have to conduct significant projects
and cope with realistic (or even real-life) working conditions and scenarios. Often,
students are geographically dispersed or under severe timing constraints, because these
activities intertwine with their normal university activity. As a result, they need
communication technology in order to interact and workflow technology in order to
organize their work. The developed platform provides a comprehensible, e-learning-
specific set of activities and process templates, which can be combined through a simple
Web interface into project-centered collaboration processes. We discuss how the general
paradigm of flexible processes was adapted to the learning concept, implemented, and
experienced by students.
43 An Implementation of the CORDRA Architecture Enhanced for Systematic

Reuse of Learning Objects
Lin, F.H. Shih, T.K. Won Kim
Dept. of Inf. Manage., Chihlee Inst. of Technol., Taipei;
On page(s): 925-938
ISSN: 1041-4347
Abstract
The Sharable Content Object Reference Model (SCORM) specification defines metadata
of learning objects, which are used as the elementary reusable components in distance
learning. The Content Object Repository Discovery And Registration/Resolution
Architecture (CORDRA) specification provides a common architecture for the resolution,
discovery, and sharing of these learning objects. These two specifications together define
standardized ways in which learning objects can be discovered and reused by content
designers. However, the current CORDRA and the definition of objects in SCORM only
allow an object to be copied, updated, and reorganized in a new content aggregation,
which is used as a delivery package to end users. This paper proposes a revised
CORDRA architecture and a reusability mechanism to make instruction design easier. In
particular, it proposes a structure called a reusability tree for tracking the history of reuse
of learning objects in CORDRA. This paper also defines the notions of similarity,
diversity, and relevancy of learning objects to make it easier for users to precisely search
for and reuse learning objects.
44 On the Expressive Power of the Relational Algebra on Finite Sets of Relation

Pairs
Fletcher, G.H.L. Gyssens, M. Paredaens, J. Van Gucht, D.
Sch. of Eng. & Comput. Sci., Washington State Univ., Vancouver, WA;
On page(s): 939-942
ISSN: 1041-4347
Abstract
We give a language-independent characterization of the expressive power of the
relational algebra on finite sets of source-target relation instance pairs. The associated
decision problem is shown to be co-graph-isomorphism hard and in co NP. The main
result is also applied in providing a new characterization of the generic relational queries.
45.Adapted One-versus-All Decision Trees for Data Stream Classification
Hashemi, S. Ying Yang Mirzamomen, Z. Kangavari, M.
Monash Univ., Clayton, VIC;
Publication Date: May 2009
On page(s): 624-637
ISSN: 1041-4347
Abstract
One versus all (OVA) decision trees learn k individual binary classifiers, each one to
distinguish the instances of a single class from the instances of all other classes. Thus
OVA is different from existing data stream classification schemes whose majority use
multiclass classifiers, each one to discriminate among all the classes. This paper
advocates some outstanding advantages of OVA for data stream classification. First,
there is low error correlation and hence high diversity among OVA's component
classifiers, which leads to high classification accuracy. Second, OVA is adept at
accommodating new class labels that often appear in data streams. However, there also
remain many challenges to deploy traditional OVA for classifying data streams. First, as
every instance is fed to all component classifiers, OVA is known as an inefficient model.
Second, OVA's classification accuracy is adversely affected by the imbalanced class
distribution in data streams. This paper addresses those key challenges and consequently
proposes a new OVA scheme that is adapted for data stream classification. Theoretical
analysis and empirical evidence reveal that the adapted OVA can offer faster training,
faster updating and higher classification accuracy than many existing popular data stream
classification algorithms.
46 Bayes Vector Quantizer for Class-Imbalance Problem

Diamantini, C. Potena, D.
Dipt. di Ing. Inf., Gestionale e dell'Autom., Universitd Politec. delle Marche, Ancona;
On page(s): 638-651
ISSN: 1041-4347
Abstract
The class-imbalance problem is the problem of learning a classification rule from data
that are skewed in favor of one class. On these datasets traditional learning techniques
tend to overlook the less numerous class, at the advantage of the majority class. However,
the minority class is often the most interesting one for the task at hand. For this reason,
the class-imbalance problem has received increasing attention in the last few years. In the
present paper we point the attention of the reader to a learning algorithm for the
minimization of the average misclassification risk. In contrast to some popular class-
imbalance learning methods, this method has its roots in statistical decision theory. A
particular interesting characteristic is that when class distributions are unknown, the
method can work by resorting to stochastic gradient algorithm. We study the behavior of
this algorithm on imbalanced datasets, demonstrating that this principled approach allows
to obtain better classification performances compared to the principal methods proposed
in the literature.
47 Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data

Hung-Leng Chen Ming-Syan Chen Su-Chen Lin
Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei;
On page(s): 652-665
ISSN: 1041-4347
Abstract
Sampling has been recognized as an important technique to improve the efficiency of
clustering. However, with sampling applied, those points that are not sampled will not
have their labels after the normal process. Although there is a straightforward approach in
the numerical domain, the problem of how to allocate those unlabeled data points into
proper clusters remains as a challenging issue in the categorical domain. In this paper, a
mechanism named MAximal Resemblance Data Labeling (abbreviated as MARDL) is
proposed to allocate each unlabeled data point into the corresponding appropriate cluster
based on the novel categorical clustering representative, namely, N-Nodeset Importance
Representative (abbreviated as NNIR), which represents clusters by the importance of the
combinations of attribute values. MARDL has two advantages: (1) MARDL exhibits high
execution efficiency, and (2) MARDL can achieve high intracluster similarity and low
intercluster similarity, which are regarded as the most important properties of clusters,
thus benefiting the analysis of cluster behaviors. MARDL is empirically validated on real
and synthetic data sets and is shown to be significantly more efficient than prior works
while attaining results of high quality.
48 GridVideo: A Practical Example of Nonscientific Application on the Grid

Bruneo, D. Iellamo, G. Minutoli, G. Puliafito, A.
Dept. of Math., Univ. of Messina, Messina;
On page(s): 666-680
ISSN: 1041-4347
Abstract
Starting from 1990s and until now, Grid computing has been mainly used in scientific
laboratories. Only in the last few years, it is evolving into a business-innovating
technology that is driving commercial adoption. In this paper, we describe GridVideo, a
Grid-based multimedia application for the distributed tailoring and streaming of media
files. The objective is to show, starting from a real experience, how Grid technologies can
be used for the development of nonscientific applications. Relevant performance aspects
are analyzed, regarding both user-oriented (in terms of responsiveness) and provider-
oriented (in terms of system efficiency) requirements. Different multimedia data
dissemination strategies have been analyzed and an innovative technique, based on the
Fibonacci series, is proposed. To respond to the stringent quality-of-service (QoS)
requirements, typical of soft real-time applications, a reservation-based architecture is
presented. Such architecture is able to manage the Grid resource allocation, thus enabling
the provisioning of advanced services with different QoS levels. Technical and practical
problems encountered during the development are discussed, and a thorough performance
evaluation of the developed prototype is presented.
49 Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster

Summarization
Hammouda, K.M. Kamel, M.S.
Desire2Learn Inc., Kitchener, ON;
On page(s): 681-698
ISSN: 1041-4347
Abstract
In distributed data mining, adopting a flat node distribution model can affect scalability.
To address the problem of modularity, flexibility and scalability, we propose a
Hierarchically-distributed Peer-to-Peer (HP2PC) architecture and clustering algorithm.
The architecture is based on a multi-layer overlay network of peer neighborhoods.
Supernodes, which act as representatives of neighborhoods, are recursively grouped to
form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate
within their respective neighborhoods to perform P2P clustering. Using this model, we
can partition the clustering problem in a modular way across neighborhoods, solve each
part individually using a distributed K-means variant, then successively combine
clusterings up the hierarchy where increasingly more global solutions are computed. In
addition, for document clustering applications, we summarize the distributed document
clusters using a distributed keyphrase extraction algorithm, thus providing interpretation
of the clusters. Results show decent speedup, reaching 165 times faster than centralized
clustering for a 250-node simulated network, with comparable clustering quality to the
centralized approach. We also provide comparison to the P2P K-means algorithm and
show that HP2PC accuracy is better for typical hierarchy heights. Results for distributed
cluster summarization match those of their centralized counterparts with up to 88%
accuracy.
50 Exact Knowledge Hiding through Database Extension

Gkoulalas-Divanis, A. Verykios, V.S.
Dept. of Comput. & Commun. Eng., Univ. of Thessaly, Volos;
On page(s): 699-713
ISSN: 1041-4347
Abstract
In this paper, we propose a novel, exact border-based approach that provides an optimal
solution for the hiding of sensitive frequent itemsets by (i) minimally extending the
original database by a synthetically generated database part - the database extension, (ii)
formulating the creation of the database extension as a constraint satisfaction problem,
(iii) mapping the constraint satisfaction problem to an equivalent binary integer
programming problem, (iv) exploiting underutilized synthetic transactions to
proportionally increase the support of non-sensitive itemsets, (v) minimally relaxing the
constraint satisfaction problem to provide an approximate solution close to the optimal
one when an ideal solution does not exist, and (vi) by using a partitioning in the universe
of the items to increase the efficiency of the proposed hiding algorithm. Extending the
original database for sensitive itemset hiding is proved to provide optimal solutions to an
extended set of hiding problems compared to previous approaches and to provide
solutions of higher quality. Moreover, the application of binary integer programming
enables the simultaneous hiding of the sensitive itemsets and thus allows for the
identification of globally optimal solutions.
51
GLIP: A Concurrency Control Protocol for Clipping Indexing

Chang-Tien Lu Jing Dai Ying Jin Mathuria, J.
Dept. of Comput. Sci., Virginia Tech., Falls Church, VA;
On page(s): 714-728
ISSN: 1041-4347
Abstract
Multidimensional databases are beginning to be used in a wide range of applications. To
meet this fast-growing demand, the R-tree family is being applied to support fast access
to multidimensional data, for which the R+-tree exhibits outstanding search performance.
In order to support efficient concurrent access in multiuser environments, concurrency
control mechanisms for multidimensional indexing have been proposed. However, these
mechanisms cannot be directly applied to the R+-tree because an object in the R+-tree
may be indexed in multiple leaves. This paper proposes a concurrency control protocol
for R-tree variants with object clipping, namely, Granular Locking for clipping indexing
(GLIP). GLIP is the first concurrency control approach specifically designed for the R+-
tree and its variants, and it supports efficient concurrent operations with serializable
isolation, consistency, and deadlock-free. Experimental tests on both real and synthetic
data sets validated the effectiveness and efficiency of the proposed concurrent access
framework.
52 Fast Query Point Movement Techniques for Large CBIR Systems

Danzhou Liu Hua, K.A. Khanh Vu Ning Yu
Sch. of Electr. Eng. & Comput. Sci., Univ. of Central Florida, Orlando, FL;
On page(s): 729-743
ISSN: 1041-4347
Abstract
Target search in content-based image retrieval (CBIR) systems refers to finding a specific
(target) image such as a particular registered logo or a specific historical photograph.
Existing techniques, designed around query refinement based on relevance feedback,
suffer from slow convergence, and do not guarantee to find intended targets. To address
these limitations, we propose several efficient query point movement methods. We prove
that our approach is able to reach any given target image with fewer iterations in the
worst and average cases. We propose a new index structure and query processing
technique to improve retrieval effectiveness and efficiency. We also consider strategies to
minimize the effects of users' inaccurate relevance feedback. Extensive experiments in
simulated and realistic environments show that our approach significantly reduces the
number of required iterations and improves overall retrieval performance. The
experimental results also confirm that our approach can always retrieve intended targets
even with poor selection of initial query points.
53 Schema Vacuuming in Temporal Databases

Roddick, J.F.
Sch. of Comput. Sci., Eng. & Math., Flinders Univ., Adelaide, SA;
On page(s): 744-747
ISSN: 1041-4347
Abstract
Temporal databases facilitate the support of historical information by providing functions
for indicating the intervals during which a tuple was applicable (along one or more
temporal dimensions). Because data are never deleted, only superceded, temporal
databases are inherently append-only resulting, over time, in a large historical sequence
of database states. Data vacuuming in temporal databases allows for this sequence to be
shortened by strategically, and irrevocably, deleting obsolete data. Schema versioning
allows users to maintain a history of database schemata without compromising the
semantics of the data or the ability to view data through historical schemata. While the
techniques required for data vacuuming in temporal databases have been relatively well
covered, the associated area of vacuuming schemata has received less attention. This
paper discusses this issue and proposes a mechanism that fits well with existing methods
for data vacuuming and schema versioning.
54 The Subgraph Similarity Problem

De Nardo, L. Ranzato, F. Tapparo, F.
Dipt. di Mat. Pura ed Applicata, Univ. of Padova, Padova;
On page(s): 748-749
ISSN: 1041-4347
Abstract
Similarity is a well known weakening of bisimilarity where one system is required to
simulate the other and vice versa. It has been shown that the subgraph bisimilarity
problem, a variation of the subgraph isomorphism problem where isomorphism is
weakened to bisimilarity, is NP-complete. We show that the subgraph similarity problem
and some related variations thereof still remain NP-complete.
55 Histogram-Based Global Load Balancing in Structured Peer-to-Peer Systems

Quang Hieu Vu Beng Chin Ooi Rinard, M. Kian-Lee Tan
Sch. of Comput., Nat. Univ. of Singapore, Singapore;
Publication Date: April 2009
On page(s): 595-608
ISSN: 1041-4347
Abstract
Over the past few years, peer-to-peer (P2P) systems have rapidly grown in popularity and
have become a dominant means for sharing resources. In these systems, load balancing is
a key challenge because nodes are often heterogeneous. While several load-balancing
schemes have been proposed in the literature, these solutions are typically ad hoc,
heuristic based, and localized. In this paper, we present a general framework, HiGLOB,
for global load balancing in structured P2P systems. Each node in HiGLOB has two key
components: 1) a histogram manager maintains a histogram that reflects a global view of
the distribution of the load in the system, and 2) a load-balancing manager that
redistributes the load whenever the node becomes overloaded or underloaded. We exploit
the routing metadata to partition the P2P network into nonoverlapping regions
corresponding to the histogram buckets. We propose mechanisms to keep the cost of
constructing and maintaining the histograms low. We further show that our scheme can
control and bound the amount of load imbalance across the system. Finally, we
demonstrate the effectiveness of HiGLOB by instantiating it over three existing
structured P2P systems: Skip Graph, BATON, and Chord. Our experimental results
indicate that our approach works well in practice.
56
Progressive Parametric Query Optimization

Bizarro, P. Bruno, N. DeWitt, D.J.
CISUC/DEI, Univ. of Coimbra, Coimbra;
On page(s): 582-594
ISSN: 1041-4347
Abstract
Commercial applications usually rely on pre-compiled parameterized procedures to
interact with a database. Unfortunately, executing a procedure with a set of parameters
different from those used at compilation time may be arbitrarily sub-optimal. Parametric
query optimization (PQO) attempts to solve this problem by exhaustively determining the
optimal plans at each point of the parameter space at compile time. However, PQO is
likely not cost-effective if the query is executed infrequently or if it is executed with
values only within a subset of the parameter space. In this paper we propose instead to
progressively explore the parameter space and build a parametric plan during several
executions of the same query. We introduce algorithms that, as parametric plans are
populated, are able to frequently bypass the optimizer but still execute optimal or near-
optimal plans.
57
Multiscale Representations for Fast Pattern Matching in Stream Time Series

Xiang Lian Lei Chen Yu, J.X. Jinsong Han Jian Ma
Dept. of Comput. Sci. & Eng., Hong Kong Univ. of Sci. & Technol., Kowloon;
On page(s): 568-581
ISSN: 1041-4347
Abstract
Similarity-based time-series retrieval has been a subject of long-term study due to its
wide usage in many applications, such as financial data analysis, weather data
forecasting, and multimedia data retrieval. Its original task was to find those time series
similar to a pattern (query) time-series data, where both the pattern and data time series
are static. Recently, with an increasing demand on stream data management, similarity-
based stream time-series retrieval has raised new research issues due to its unique
requirements during the stream processing, such as one-pass search and fast response. In
this paper, we address the problem of matching both static and dynamic patterns over
stream time-series data. We will develop a novel multiscale representation, called
multiscale segment mean, for stream time-series data, which can be incrementally
computed and thus perfectly adapted to the stream characteristics. Most importantly, we
propose a novel multistep filtering mechanism, step by step, over the multiscale
representation. Analysis indicates that the mechanism can greatly prune the search space
and thus offer fast response. Furthermore, batch processing optimization and the dynamic
case where patterns are also from stream time series are discussed. Extensive experiments
show the multiscale representation together with the multistep filtering scheme can
efficiently filter out false candidates and detect patterns, compared to the multiscale
wavelet.
58 Optimal Lot Sizing Policies For Sequential Online Auctions

Tripathi, A.K. Nair, S.K. Karuga, G.G.
Dept. of Inf. Syst. & Oper. Manage., Univ. of Washington, Seattle, WA;
On page(s): 554-567
ISSN: 1041-4347
Abstract
This study proposes methods for determining the optimal lot sizes for sequential auctions
that are conducted to sell sizable quantities of an item. These auctions are fairly common
in business to consumer (B2C) auctions. In these auctions, the tradeoff for the auctioneer
is between the alacrity with which funds are received, and the amount of funds collected
by the faster clearing of inventory using larger lot sizes. Observed bids in these auctions
impact the auctioneer's decision on lot sizes in future auctions. We first present a goal
programming approach for estimating the bid distribution for the bidder population from
the observed bids, readily available in these auctions. We then develop models to
compute optimal lot sizes for both stationary and non-stationary bid distributions. For
stationary bid distribution, we present closed form solutions and structural results. Our
findings show that the optimal lot size increases with inventory holding costs and number
of bidders. Our model for non-stationary bid distribution captures the inter-auction
dynamics such as the number of bidders, their bids, past winning bids, and lot size. We
use simulated data to test the robustness of our model.
59 A Pure Nash Equilibrium-Based Game Theoretical Method for Data Replication

across Multiple Servers
Khan, S.U. Ahmad, I.
Dept. of Electr. & Comput. Eng., North Dakota State Univ., Fargo, ND;
On page(s): 537-553
ISSN: 1041-4347
Abstract
This paper proposes a non-cooperative game based technique to replicate data objects
across a distributed system of multiple servers in order to reduce user perceived Web
access delays. In the proposed technique computational agents represent servers and
compete with each other to optimize the performance of their servers. The optimality of a
non-cooperative game is typically described by Nash equilibrium, which is based on
spontaneous and non-deterministic strategies. However, Nash equilibrium may or may
not guarantee system-wide performance. Furthermore, there can be multiple Nash
equilibria, making it difficult to decide which one is the best. In contrast, the proposed
technique uses the notion of pure Nash equilibrium, which if achieved, guarantees stable
optimal performance. In the proposed technique, agents use deterministic strategies that
work in conjunction with their self-interested nature but ensure system-wide performance
enhancement. In general, the existence of a pure Nash equilibrium is hard to achieve, but
we prove the existence of such equilibrium in the proposed technique. The proposed
technique is also experimentally compared against some well-known conventional replica
allocation methods, such as branch and bound, greedy, and genetic algorithms.
60 On the Design and Applicability of Distance Functions in High-Dimensional Data

Space
Chih-Ming Hsu Ming-Syan Chen
Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei;
On page(s): 523-536
ISSN: 1041-4347
Abstract
Effective distance functions in high dimensional data space are very important in
solutions for many data mining problems. Recent research has shown that if the Pearson
variation of the distance distribution converges to zero with increasing dimensionality,
the distance function will become unstable (or meaningless) in high dimensional space,
even with the commonly used Lp metric in the Euclidean space. This result has spawned
many studies the along the same lines. However, the necessary condition for unstability
of a distance function, which is required for function design, remains unknown. In this
paper, we shall prove that several important conditions are in fact equivalent to
unstability. Based on these theoretical results, we employ some effective and valid
indices for testing the stability of a distance function. In addition, this theoretical analysis
inspires us that unstable phenomena are rooted in variation of the distance distribution.
To demonstrate the theoretical results, we design a meaningful distance function, called
the shrinkage-divergence proximity (SDP), based on a given distance function. It is
shown empirically that the SDP significantly outperforms other measures in terms of
stability in high dimensional data space, and is thus more suitable for distance-based
clustering applications.
61
Mining Projected Clusters in High-Dimensional Spaces

Bouguessa, M. Shengrui Wang
Dept. of Comput. Sci., Univ. of Sherbrooke, Sherbrooke, QC;
On page(s): 507-522
ISSN: 1041-4347
Abstract
Clustering high-dimensional data has been a major challenge due to the inherent sparsity
of the points. Most existing clustering algorithms become substantially inefficient if the
required similarity measure is computed between data points in the full-dimensional
space. To address this problem, a number of projected clustering algorithms have been
proposed. However, most of them encounter difficulties when clusters hide in subspaces
with very low dimensionality. These challenges motivate our effort to propose a robust
partitional distance-based projected clustering algorithm. The algorithm consists of three
phases. The first phase performs attribute relevance analysis by detecting dense and
sparse regions and their location in each attribute. Starting from the results of the first
phase, the goal of the second phase is to eliminate outliers, while the third phase aims to
discover clusters in different subspaces. The clustering process is based on the k-means
algorithm, with the computation of distance restricted to subsets of attributes where
object values are dense. Our algorithm is capable of detecting projected clusters of low
dimensionality embedded in a high-dimensional space and avoids the computation of the
distance in the full-dimensional space. The suitability of our proposal has been
demonstrated through an empirical study using synthetic and real datasets.
62 IMine: Index Support for Item Set Mining

Baralis, E. Cerquitelli, T. Chiusano, S.
Dipt. di Autom. e Inf., Politec. di Torino, Torino;
On page(s): 493-506
ISSN: 1041-4347
Abstract
This paper presents the IMine index, a general and compact structure which provides
tight integration of item set extraction in a relational DBMS. Since no constraint is
enforced during the index creation phase, IMine provides a complete representation of
the original database. To reduce the I/O cost, data accessed together during the same
extraction phase are clustered on the same disk block. The IMine index structure can be
efficiently exploited by different item set extraction algorithms. In particular, IMine data
access methods currently support the FP-growth and LCM v.2 algorithms, but they can
straightforwardly support the enforcement of various constraint categories. The IMine
index has been integrated into the PostgreSQL DBMS and exploits its physical level
access methods. Experiments, run for both sparse and dense data distributions, show the
efficiency of the proposed index and its linear scalability also for large datasets. Item set
mining supported by the IMine index shows performance always comparable with, and
sometimes better than, state of the art algorithms accessing data on flat file.
63 Compression and Aggregation for Logistic Regression Analysis in Data Cubes

Ruibin Xi Nan Lin Yixin Chen
Dept. of Math., Washington Univ., St. Louis, MO;
On page(s): 479-492
ISSN: 1041-4347
Abstract
Logistic regression is an important technique for analyzing and predicting data with
categorical attributes. In this paper, We consider supporting online analytical processing
(OLAP) of logistic regression analysis for multi-dimensional data in a data cube where it
is expensive in time and space to build logistic regression models for each cell from the
raw data. We propose a novel scheme to compress the data in such a way that we can
reconstruct logistic regression models to answer any OLAP query without accessing the
raw data. Based on a first-order approximation to the maximum likelihood estimating
equations, we develop a compression scheme that compresses each base cell into a small
compressed data block with essential information to support the aggregation of logistic
regression models. Aggregation formulae for deriving high-level logistic regression
models from lower level component cells are given. We prove that the compression is
nearly lossless in the sense that the aggregated estimator deviates from the true model by
an error that is bounded and approaches to zero when the data size increases. The results
show that the proposed compression and aggregation scheme can make feasible OLAP of
logistic regression in a data cube. Further, it supports real-time logistic regression
analysis of stream data, which can only be scanned once and cannot be permanently
retained. Experimental results validate our theoretical analysis and demonstrate that our
method can dramatically save time and space costs with almost no degradation of the
modeling accuracy.
64 A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems
Wolff, R. Bhaduri, K. Kargupta, H.
Dept. of Manage. Inf. Syst., Haifa Univ., Haifa;
On page(s): 465-478
ISSN: 1041-4347
Abstract
In a large network of computers or wireless sensors, each of the components (henceforth,
peers) has some data about the global state of the system. Much of the system's
functionality such as message routing, information retrieval and load sharing relies on
modeling the global state. We refer to the outcome of the function (e.g., the load
experienced by each peer) as the emph{model} of the system. Since the state of the
system is constantly changing, it is necessary to keep the models up-to-date. Computing
global data mining models e.g. decision trees, k-means clustering in large distributed
systems may be very costly due to the scale of the system and due to communication
cost, which may be high. The cost further increases in a dynamic scenario when the data
changes rapidly. In this paper we describe a two step approach for dealing with these
costs. First, we describe a highly efficient emph{local} algorithm which can be used to
monitor a wide class of data mining models. Then, we use this algorithm as a feedback
loop for the monitoring of complex functions of the data such as its k-means clustering.
The theoretical claims are corroborated with a thorough experimental analysis.
65 mproving Personalization Solutions through Optimal Segmentation of Customer

Bases
Tianyi Jiang Tuzhilin, A.
AvePoint Inc., Jersey, NJ;
Publication Date: March 2009
On page(s): 305-320
Location: Los Angeles, CA, USA,
ISSN: 1041-4347
Abstract
On the Web, where the search costs are low and the competition is just a mouse click
away, it is crucial to segment the customers intelligently in order to offer more targeted
and personalized products and services to them. Traditionally, customer segmentation is
achieved using statistics-based methods that compute a set of statistics from the customer
data and group customers into segments by applying distance-based clustering algorithms
in the space of these statistics. In this paper, we present a direct grouping-based approach
to computing customer segments that groups customers not based on computed statistics,
but in terms of optimally combining transactional data of several customers to build a
data mining model of customer behavior for each group. Then, building customer
segments becomes a combinatorial optimization problem of finding the best partitioning
of the customer base into disjoint groups. This paper shows that finding an optimal
customer partition is NP-hard, proposes several suboptimal direct grouping segmentation
methods, and empirically compares them among themselves, traditional statistics-based
hierarchical and affinity propagation-based segmentation, and one-to-one methods across
multiple experimental conditions. It is shown that the best direct grouping method
significantly dominates the statistics-based and one-to-one approaches across most of the
experimental conditions, while still being computationally tractable. It is also shown that
the distribution of the sizes of customer segments generated by the best direct grouping
method follows a power law distribution and that microsegmentation provides the best
approach to personalization.
66
Effective and Efficient Query Processing for Video Subsequence Identification

Heng Tao Shen Jie Shao Zi Huang Xiaofang Zhou
Sch. of Inf. Technol. & Electr. Eng., Univ. of Queensland, Brisbane, QLD;
On page(s): 321-334
ISSN: 1041-4347
Abstract
With the growing demand for visual information of rich content, effective and efficient
manipulations of large video databases are increasingly desired. Many investigations
have been made on content-based video retrieval. However, despite the importance, video
subsequence identification, which is to find the similar content to a short query clip from
a long video sequence, has not been well addressed. This paper presents a graph
transformation and matching approach to this problem, with extension to identify the
occurrence of potentially different ordering or length due to content editing. With a novel
batch query algorithm to retrieve similar frames, the mapping relationship between the
query and database video is first represented by a bipartite graph. The densely matched
parts along the long sequence are then extracted, followed by a filter-and-refine search
strategy to prune some irrelevant subsequences. During the filtering stage, maximum size
matching is deployed for each subgraph constructed by the query and candidate
subsequence to obtain a smaller set of candidates. During the refinement stage, sub-
maximum similarity matching is devised to identify the subsequence with the highest
aggregate score from all candidates, according to a robust video similarity model that
incorporates visual content, temporal order, and frame alignment information. The
performance studies conducted on a long video recording of 50 hours validate that our
approach is promising in terms of both search accuracy and speed.
67 Automatically Determining the Number of Clusters in Unlabeled Data Sets
Liang Wang Leckie, C. Ramamohanarao, K. Bezdek, J.
Dept. of Comput. Sci. & Software Eng., Univ. of Melbourne, Melbourne, VIC;
On page(s): 335-350
ISSN: 1041-4347
Abstract
Clustering is a popular tool for exploratory data analysis. One of the major problems in
cluster analysis is the determination of the number of clusters in unlabeled data, which is
a basic input for most clustering algorithms. In this paper we investigate a new method
called DBE (dark block extraction) for automatically estimating the number of clusters in
unlabeled data sets, which is based on an existing algorithm for visual assessment of
cluster tendency (VAT) of a data set, using several common image and signal processing
techniques. Basic steps include: 1) generating a VAT image of an input dissimilarity
matrix; 2) performing image segmentation on the VAT image to obtain a binary image,
followed by directional morphological filtering; 3) applying a distance transform to the
filtered binary image and projecting the pixel values onto the main diagonal axis of the
image to form a projection signal; 4) smoothing the projection signal, computing its first-
order derivative, and then detecting major peaks and valleys in the resulting signal to
decide the number of clusters. Our new DBE method is nearly "automatic", depending on
just one easy-to-set parameter. Several numerical and real-world examples are presented
to illustrate the effectiveness of DBE.
68 Efficient Processing of Metric Skyline Queries

Lei Chen Xiang Lian
Dept. of Comput. Sci. & Eng., Hong Kong Univ. of Sci. & Technol., Hong Kong;
On page(s): 351-365
ISSN: 1041-4347
Abstract
Skyline query is of great importance in many applications, such as multi-criteria decision
making and business planning. In particular, a skyline point is a data object in the
database whose attribute vector is not dominated by that of any other objects. Previous
methods to retrieve skyline points usually assume static data objects in the database (i.e.
their attribute vectors are fixed), whereas several recent work focus on skyline queries
with dynamic attributes. In this paper, we propose a novel variant of skyline queries,
namely metric skyline, whose dynamic attributes are defined in the metric space (i.e. not
limited to the Euclidean space). We illustrate an efficient and effective pruning
mechanism to answer metric skyline queries through a metric index. Most importantly,
we formalize the query performance of the metric skyline query in terms of the pruning
power, by a cost model, in light of which we construct an optimized metric index aiming
to maximize the pruning power of metric skyline queries. Extensive experiments have
demonstrated the efficiency and effectiveness of our proposed pruning techniques as well
as the constructed index in answering metric skyline queries.
69 On the Effect of Location Uncertainty in Spatial Querying

Frentzos, E. Gratsias, K. Theodoridis, Y.
Univ. of Piraeus, Piraeus;
On page(s): 366-383
ISSN: 1041-4347
Abstract
An emerging topic in the field of spatial data management is the handling of location
uncertainty of spatial objects, mainly due to inaccurate measurements. The literature on
location uncertainty so far has focused on modifying traditional spatial search algorithms
in order to handle the impact of objects' location uncertainty in query results. In this
paper, we present the first, to the best of our knowledge, theoretical analysis that
estimates the average number of false hits introduced in the results of rectangular range
queries in the case of data points uniformly distributed in 2D space. Then, we relax the
original distribution assumptions showing how to deal with arbitrarily distributed data
points and more realistic location uncertainty distributions. The accuracy of the results of
our analytical approach is demonstrated through an extensive experimental study using
various synthetic and real datasets. Our proposal can be directly employed in spatial
database systems in order to provide users with the accuracy of spatial query results
based only on known dataset and query parameters.
70
Distributed Skyline Retrieval with Low Bandwidth Consumption

Lin Zhu Yufei Tao Shuigeng Zhou
Fudan Univ., Shanghai;
On page(s): 384-400
ISSN: 1041-4347
Abstract
We consider skyline computation when the underlying data set is horizontally partitioned
onto geographically distant servers that are connected to the Internet. The existing
solutions are not suitable for our problem, because they have at least one of the following
drawbacks: (1) applicable only to distributed systems adopting vertical partitioning or
restricted horizontal partitioning, (2) effective only when each server has limited
computing and communication abilities, and (3) optimized only for skyline search in
subspaces but inefficient in the full space. This paper proposes an algorithm, called
feedback-based distributed skyline (FDS), to support arbitrary horizontal partitioning.
FDS aims at minimizing the network bandwidth, measured in the number of tuples
transmitted over the network. The core of FDS is a novel feedback-driven mechanism,
where the coordinator iteratively transmits certain feedback to each participant.
Participants can leverage such information to prune a large amount of local data, which
otherwise would need to be sent to the coordinator. Extensive experimentation confirms
that FDS significantly outperforms alternative approaches in both effectiveness and
progressiveness.
71
semQA: SPARQL with Idempotent Disjunction

Shironoshita, E.P. Jean-Mary, Y.R. Bradley, R.M. Kabuka, M.R.
INFO-TECH Soft Inc., Miami, FL;
On page(s): 401-414
ISSN: 1041-4347
Abstract
The SPARQL LeftJoin abstract operator is not distributive over Union; this limits the
algebraic manipulation of graph patterns, which in turn restricts the ability to create query
plans for distributed processing or query optimization. In this paper, we present semQA,
an algebraic extension for the SPARQL query language for RDF, which overcomes this
issue by transforming graph patterns through the use of an idempotent disjunction
operator Or as a substitute for Union. This permits the application of a set of equivalences
that transform a query into distinct forms. We further present an algorithm to derive the
solution set of the original query from the solution set of a query where Union has been
substituted by Or. We also analyze the combined complexity of SPARQL, proving it to
be NP-complete. It is also shown that the SPARQL query language is not, in the general
case, fixed-parameter tractable. Experimental results are presented to validate the query
evaluation methodology presented in this paper against the SPARQL standard, to
corroborate the complexity analysis, and to illustrate the gains in processing cost
reduction that can be obtained through the application of semQA.
72 Detecting, Assessing and Monitoring Relevant Topics in Virtual Information

Environments
Ontrup, J. Ritter, H. Scholz, S.W. Wagner, R.
Fac. of Technol., Bielefeld Univ., Bielefeld;
On page(s): 415-427
ISSN: 1041-4347
Abstract
The ability to assess the relevance of topics and related sources in information-rich
environments is a key to success when scanning business environments. This paper
introduces a hybrid system to support managerial information gathering. The system is
made up of three components: 1) a hierarchical hyperbolic SOM for structuring the
information environment and visualizing the intensity of news activity with respect to
identified topics, 2) a spreading activation network for the selection of the most relevant
information sources with respect to an already existing knowledge infrastructure, and 3)
measures of interestingness for association rules as well as statistical testing facilitates the
monitoring of already identified topics. Embedding the system by a framework
describing three modes of human information seeking behavior endorses an active
organization, exploration and selection of information that matches the needs of decision
makers in all stages of the information gathering process. By applying our system in the
domain of the hotel industry we demonstrate how typical information gathering tasks are
supported. Moreover, we present an empirical study investigating the effectiveness and
efficiency of the visualization framework of our system.
73 Distributional Features for Text Categorization

Xiao-Bing Xue Zhi-Hua Zhou
Nanjing Univ., Nanjing;
On page(s): 428-442
ISSN: 1041-4347
Abstract
Text categorization is the task of assigning predefined categories to natural language text.
With the widely used 'bag of words' representation, previous researches usually assign a
word with values such that whether this word appears in the document concerned or how
frequently this word appears. Although these values are useful for text categorization,
they have not fully expressed the abundant information contained in the document. This
paper explores the effect of other types of values, which express the distribution of a
word in the document. These novel values assigned to a word are called distributional
features, which include the compactness of the appearances of the word and the position
of the first appearance of the word. The proposed distributional features are exploited by
a tf idf style equation and different features are combined using ensemble learning
techniques. Experiments show that the distributional features are useful for text
categorization. In contrast to using the traditional term frequency values solely, including
the distributional features requires only a little additional cost, while the categorization
performance can be significantly improved. Further analysis shows that the distributional
features are especially useful when documents are long and the writing style is casual.
74
The Development of Fuzzy Rough Sets with the Use of Structures and Algebras of
Axiomatic Fuzzy Sets
Xiaodong Liu Pedrycz, W. Tianyou Chai Mingli Song
Res. Center of Inf. & Control, Dalian Univ. of Technol., Dalian;
On page(s): 443-462
ISSN: 1041-4347
Abstract
The notion of a rough set was originally proposed by Pawlak underwent a number of
extensions and generalizations. Dubois and Prade (1990) introduced fuzzy rough sets
which involve the use of rough sets and fuzzy sets within a single framework.
Radzikowska and Kerre (2002) proposed a broad family of fuzzy rough sets, referred to
as (phi, t)-fuzzy rough sets which are determined by some implication operator
(implicator) phi and a certain t-norm. In order to describe the linguistically represented
concepts coming from data available in some information system, the concept of fuzzy
rough sets are redefined and further studied in the setting of the axiomatic fuzzy set
(AFS) theory. Compared with the (phi, t)-fuzzy rough sets, the advantages of AFS fuzzy
rough sets are twofold. They can be directly applied to data analysis present in any
information system without resorting to the details concerning the choice of the
implication phi, t-norm and a similarity relation S. Furthermore such rough
approximations of fuzzy concepts come with a well-defined semantics and therefore offer
a sound interpretation. Some examples are included to illustrate the effectiveness of the
proposed construct. It is shown that the AFS fuzzy rough sets provide a far higher
flexibility and effectiveness in comparison with rough sets and some of their
generalizations.
75 Learning Image-Text Associations
Tao Jiang Ah-Hwee Tan
ecPresence Technol. Pte Ltd., Singapore;
Publication Date: Feb. 2009
On page(s): 161-177
ISSN: 1041-4347
Abstract
Web information fusion can be defined as the problem of collating and tracking
information related to specific topics on the World Wide Web. Whereas most existing
work on Web information fusion has focused on text-based multidocument
summarization, this paper concerns the topic of image and text association, a cornerstone
of cross-media Web information fusion. Specifically, we present two learning methods
for discovering the underlying associations between images and texts based on small
training data sets. The first method based on vague transformation measures the
information similarity between the visual features and the textual features through a set of
predefined domain-specific information categories. Another method uses a neural
network to learn direct mapping between the visual and textual features by automatically
and incrementally summarizing the associated features into a set of information
templates. Despite their distinct approaches, our experimental results on a terrorist
domain document set show that both methods are capable of learning associations
between images and texts from a small training data set.
76 Decompositional Rule Extraction from Support Vector Machines by Active Learning

Martens, D. Baesens, B.B. Van Gestel, T.
Dept. of Bus. Adm. & Public Manage., Hogeschool Gent, Ghent;
On page(s): 178-191
ISSN: 1041-4347
Abstract
Support vector machines (SVMs) are currently state-of-the-art for the classification task
and, generally speaking, exhibit good predictive performance due to their ability to model
nonlinearities. However, their strength is also their main weakness, as the generated
nonlinear models are typically regarded as incomprehensible black-box models. In this
paper, we propose a new active learning-based approach (ALBA) to extract
comprehensible rules from opaque SVM models. Through rule extraction, some insight is
provided into the logics of the SVM model. ALBA extracts rules from the trained SVM
model by explicitly making use of key concepts of the SVM: the support vectors, and the
observation that these are typically close to the decision boundary. Active learning
implies the focus on apparent problem areas, which for rule induction techniques are the
regions close to the SVM decision boundary where most of the noise is found. By
generating extra data close to these support vectors that are provided with a class label by
the trained SVM model, rule induction techniques are better able to discover suitable
discrimination rules. This performance increase, both in terms of predictive accuracy as
comprehensibility, is confirmed in our experiments where we apply ALBA on several
publicly available data sets.
77
Multiclass MTS for Simultaneous Feature Selection and Classification

Chao-Ton Su Yu-Hsiang Hsiao
Nat. Tsing Hua Univ., Hsinchu;
On page(s): 192-205
ISSN: 1041-4347
Abstract
Multiclass Mahalanobis-Taguchi system (MMTS), the extension of MTS, is developed
for simultaneous multiclass classification and feature selection. In MMTS, the multiclass
measurement scale is constructed by establishing an individual Mahalanobis space for
each class. To increase the validity of the measurement scale, the Gram-Schmidt process
is performed to mutually orthogonalize the features and eliminate the multicollinearity.
The important features are identified using the orthogonal arrays and the signal-to-noise
ratio, and are then used to construct a reduced model measurement scale. The
contribution of each important feature to classification is also derived according to the
effect gain to develop a weighted Mahalanobis distance which is finally used as the
distance metric for the classification of MMTS. Using the reduced model measurement
scale, an unknown example will be classified into the class with minimum weighted
Mahalanobis distance considering only the important features. For evaluating the
effectiveness of MMTS, a numerical experiment is implemented, and the results show
that MMTS outperforms other well-known algorithms not only on classification accuracy
but also on feature selection efficiency. Finally, a real case about gestational diabetes
mellitus is studied, and the results indicate the practicality of MMTS in real-world
applications.
78 k-Anonymization with Minimal Loss of Information

Gionis, A. Tassa, T.
Yahoo! Res., Barcelona;
On page(s): 206-219
ISSN: 1041-4347
Abstract
The technique of k-anonymization allows the releasing of databases that contain personal
information while ensuring some degree of individual privacy. Anonymization is usually
performed by generalizing database entries. We formally study the concept of
generalization, and propose three information-theoretic measures for capturing the
amount of information that is lost during the anonymization process. The proposed
measures are more general and more accurate than those that were proposed by Meyerson
and Williams and Aggarwal et al. We study the problem of achieving k-anonymity with
minimal loss of information. We prove that it is NP-hard and study polynomial
approximations for the optimal solution. Our first algorithm gives an approximation
guarantee of O(ln k) for two of our measures as well as for the previously studied
measures. This improves the best-known O(k)-approximation in. While the previous
approximation algorithms relied on the graph representation framework, our algorithm
relies on a novel hypergraph representation that enables the improvement in the
approximation ratio from O(k) to O(ln k). As the running time of the algorithm is O(n2k}),
we also show how to adapt the algorithm in in order to obtain an O(k)-approximation
algorithm that is polynomial in both n and k.
79 Cost-Based Predictive Spatiotemporal Join

Wook-Shin Han Jaehwa Kim Byung Suk Lee Yufei Tao Rantzau, R. Markl, V.
Dept. of Comput. Eng., Kyungpook Nat. Univ., Daegu;
On page(s): 220-233
ISSN: 1041-4347
Abstract
A predictive spatiotemporal join finds all pairs of moving objects satisfying a join
condition on future time and space. In this paper, we present CoPST, the first and
foremost algorithm for such a join using two spatiotemporal indexes. In a predictive
spatiotemporal join, the bounding boxes of the outer index are used to perform window
searches on the inner index, and these bounding boxes enclose objects with increasing
laxity over time. CoPST constructs globally tightened bounding boxes "on the fly" to
perform window searches during join processing, thus significantly minimizing overlap
and improving the join performance. CoPST adapts gracefully to large-scale databases,
by dynamically switching between main-memory buffering and disk-based buffering,
through a novel probabilistic cost model. Our extensive experiments validate the cost
model and show its accuracy for realistic data sets. We also showcase the superiority of
CoPST over algorithms adapted from state-of-the-art spatial join algorithms, by a
speedup of up to an order of magnitude.
80 BMQ-Processor: A High-Performance Border-Crossing Event Detection Framework

for Large-Scale Monitoring Applications
Jinwon Lee Seungwoo Kang Youngki Lee Sang Jeong Lee Junehwa Song
Div. of Comput. Sci., Korea Adv. Inst. of Sci. & Technol. (KAIST), Daejeon;
On page(s): 234-252
ISSN: 1041-4347
Abstract
In this paper, we present BMQ-Processor, a high-performance border-crossing event
(BCE) detection framework for large-scale monitoring applications. We first characterize
a new query semantics, namely, border monitoring query (BMQ), which is useful for
BCE detection in many monitoring applications. It monitors the values of data streams
and reports them only when data streams cross the borders of its range. We then propose
BMQ-Processor to efficiently handle a large number of BMQs over a high volume of
data streams. BMQ-Processor efficiently processes BMQs in a shared and incremental
manner. It develops and operates over a novel stateful query index, achieving a high level
of scalability over continuous data updates. Also, it utilizes the locality embedded in data
streams and greatly accelerates successive BMQ evaluations. We present data structures
and algorithms to support 1D as well as multidimensional BMQs. We show that the
semantics of border monitoring can be extended toward more advanced ones and build
region transition monitoring as a sample case. Lastly, we demonstrate excellent
processing performance and low storage cost of BMQ-Processor through extensive
analysis and experiments.
81 Privacy-Preserving Kth Element Score over Vertically Partitioned Data

Vaidya, J. Clifton, C.W.
Manage. Sci. & Inf. Syst. Dept., Rutgers Univ., Newark, NJ;
On page(s): 253-258
ISSN: 1041-4347
Abstract
Given a large integer data set shared vertically by two parties, we consider the problem of
securely computing a score separating the kth and the (k + 1) to compute such a score
while revealing little additional information. The proposed protocol is implemented using
the Fairplay system and experimental results are reported. We show a real application of
this protocol as a component used in the secure processing of top-k queries over
vertically partitioned data.
82 Semantic Access to Multichannel M-Services

Xu Yang Bouguettaya, A.
Spirent Commun., Rockville, MD;
On page(s): 259-272
ISSN: 1041-4347
Abstract
M-services provide mobile users wireless access to Web services. In this paper, we
present a novel infrastructure for supporting M-services in wireless broadcast systems.
The proposed infrastructure provides a generic framework for mobile users to look up,
access, and execute Web services over wireless broadcast channels. Access efficiency is
an important issue in wireless broadcast systems. We discuss different semantics that
have impact on the access efficiency for composite M-services. A multiprocess workflow
is proposed for effectively accessing composite M-services from multiple broadcast
channels based on these semantics. We also present and compare different broadcast
channel organizations for M-services and wireless data. Analytical models are provided
for these channel organizations. Practical studies are presented to demonstrate the impact
of different semantics and channel organizations on the access efficiency.
83 Online Scheduling Sequential Objects with Periodicity for Dynamic Information
Dissemination
Chih-Lin Hu Ming-Syan Chen
Dept. of Commun. Eng., Nat. Central Univ., Jhongli;
On page(s): 273-286
ISSN: 1041-4347
Abstract
The scalability of data broadcasting has been manifested by prior studies on the base of
the traditional data management systems where data objects, mapped to a pair of state
and value in the database, are independent, persistent, and static against simple queries.
However, many modern information applications spread dynamic data objects and
process complex queries for retrieving multiple data objects. Particularly, the information
servers dynamically generate data objects that are dependent and can be associated into a
complete response against complex queries. Accordingly, the study in this paper
considers the problem of scheduling dynamic broadcast data objects in a clients-
providers-servers system from the standpoint of data association, dependency, and
dynamics. Since the data broadcast problem is NP-hard, we derive the lower and the
upper bounds of the mean service access time. In light of the theoretical analyses, we
further devise a deterministic algorithm with several gain measure functions for the
approximation of schedule optimization. The experimental results show that the proposed
algorithm is able to generate a dynamic broadcast schedule and also minimize the mean
service access time to the extent of being very close to the theoretical optimum.
84 Storing and Indexing Spatial Data in P2P Systems

Kantere, V. Skiadopoulos, S. Sellis, T.
Ecole Polytech. Fed. de Lausanne, Lausanne;
On page(s): 287-300
ISSN: 1041-4347
Abstract
The peer-to-peer (P2P) paradigm has become very popular for storing and sharing
information in a totally decentralized manner. At first, research focused on P2P systems
that host 1D data. Nowadays, the need for P2P applications with multidimensional data
has emerged, motivating research on P2P systems that manage such data. The majority of
the proposed techniques are based either on the distribution of centralized indexes or on
the reduction of multidimensional data to one dimension. Our goal is to create from
scratch a technique that is inherently distributed and also maintains the
multidimensionality of data. Our focus is on structured P2P systems that share spatial
information. We present SpatialP2P, a totally decentralized indexing and searching
framework that is suitable for spatial data. SpatialP2P supports P2P applications in which
spatial information of various sizes can be dynamically inserted or deleted, and peers can
join or leave. The proposed technique preserves well locality and directionality of space.
85 Unsupervised Multiway Data Analysis: A Literature Survey

Acar, E. Yener, B.
Dept. of Comput. Sci., Rensselaer Polytech. Inst., Troy, NY;
Publication Date: Jan. 2009
On page(s): 6-20
ISSN: 1041-4347
Abstract
Two-way arrays or matrices are often not enough to represent all the information in the
data and standard two-way analysis techniques commonly applied on matrices may fail to
find the underlying structures in multi-modal datasets. Multiway data analysis has
recently become popular as an exploratory analysis tool in discovering the structures in
higher-order datasets, where data have more than two modes. We provide a review of
significant contributions in the literature on multiway models, algorithms as well as their
applications in diverse disciplines including chemometrics, neuroscience, social network
analysis, text mining and computer vision.
86 Comparing Scores Intended for Ranking

Bhamidipati, N.L. Pal, S.K.
Data Min. & Res. Group, Yahoo! Software Dev. India Pvt. Ltd., Bangalore;
On page(s): 21-34
ISSN: 1041-4347
Abstract
Often, ranking is performed on the the basis of some scores available for each item. The
existing practice for comparing scoring functions is to compare the induced rankings by
one of the multitude of rank comparison methods available in the literature. We suggest
that it may be better to compare the underlying scores themselves. To this end, a
generalized Kendall distance is defined, which takes into consideration not only the final
ordering that the two schemes produce, but also at the spacing between pairs of scores.
This is shown to be equivalent to comparing the scores after fusing with another set of
scores, making it theoretically interesting. A top k version of the score comparison
methodology is also provided. Experimental results clearly show the advantages score
comparison has over rank comparison.
87 Online Skyline Analysis with Dynamic Preferences on Nominal Attributes
Wong, R.C.-W. Jian Pei Fu, A.W.-C. Ke Wang
Dept. of Comput. Sci. & Eng., Hong Kong Univ. of Sci. & Technol., Kowloon;
On page(s): 35-49
ISSN: 1041-4347
Abstract
The importance of skyline analysis has been well recognized in multi-criteria decision
making applications. All of the previous studies assume a fixed order on the attributes in
question. However, in some applications, users may be interested in skylines with respect
to various total or partial orders on nominal attributes. In this paper, we identify and
tackle the problem of online skyline analysis with dynamic preferences on nominal
attributes. We investigate how changes of orders in attributes lead to changes of skylines.
We address two novel types of interesting queries: a viewpoint query returns with respect
to which orders a point is (or is not) in the skylines and an order-based skyline query
retrieves the skyline with respect to a specific order. We develop two methods
systematically and report an extensive performance study using both synthetic and real
data sets to verify their effectiveness and efficiency.
88 Self-Learning Disk Scheduling

Yu Zhang Bhargava, B.
Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN;
On page(s): 50-65
ISSN: 1041-4347
Abstract
Performance of disk I/O schedulers is affected by many factors, such as workloads, file
systems, and disk systems. Disk scheduling performance can be improved by tuning
scheduler parameters, such as the length of read timers. Scheduler performance tuning is
mostly done manually. To automate this process, we propose four self-learning disk
scheduling schemes: change-sensing Round-Robin, feedback learning, per-request
learning, and two-layer learning. experiments show that the novel two-layer learning
scheme performs best. It integrates the workload-level and request-level learning
algorithms. It employs feedback learning techniques to analyze workloads, change
scheduling policy, and tune scheduling parameters automatically. We discuss schemes to
choose features for workload learning, divide and recognize workloads, generate training
data, and integrate machine learning algorithms into the two-layer learning scheme. We
conducted experiments to compare the accuracy, performance, and overhead of five
machine learning algorithms: decision tree, logistic regression, naive Bayes, neural
network, and support vector machine algorithms. Experiments with real-world and
synthetic workloads show that self-learning disk scheduling can adapt to a wide variety
of workloads, file systems, disk systems, and user preferences. It outperforms existing
disk schedulers by as much as 15.8% while consuming less than 3%-5% of CPU time.
89
Discriminative Training of the Hidden Vector State Model for Semantic Parsing
Deyu Zhou Yulan He
Inf. Res. Centre, Univ. of Reading, Reading;
On page(s): 66-77
ISSN: 1041-4347
Abstract
In this paper, we discuss how discriminative training can be applied to the hidden vector
state (HVS) model in different task domains. The HVS model is a discrete hidden
Markov model (HMM) in which each HMM state represents the state of a push-down
automaton with a finite stack size. In previous applications, maximum-likelihood
estimation (MLE) is used to derive the parameters of the HVS model. However, MLE
makes a number of assumptions and unfortunately some of these assumptions do not
hold. Discriminative training, without making such assumptions, can improve the
performance of the HVS model by discriminating the correct hypothesis from the
competing hypotheses. Experiments have been conducted in two domains: the travel
domain for the semantic parsing task using the DARPA Communicator data and the Air
Travel Information Services (ATIS) data and the bioinformatics domain for the
information extraction task using the GENIA corpus. The results demonstrate modest
improvements of the performance of the HVS model using discriminative training. In the
travel domain, discriminative training of the HVS model gives a relative error reduction
rate of 31 percent in F-measure when compared with MLE on the DARPA
Communicator data and 9 percent on the ATIS data. In the bioinformatics domain, a
relative error reduction rate of 4 percent in F-measure is achieved on the GENIA corpus.
90
SPOT Databases: Efficient Consistency Checking and Optimistic Selection in

Probabilistic Spatial Databases
Parker, A. Infantes, G. Grant, J. Subrahmanian, V.S.
Dept. of Comput. Sci., Univ. of Maryland, College Park, MD;
On page(s): 92-107
ISSN: 1041-4347
Abstract
Spatial probabilistic temporal (SPOT) databases are a paradigm for reasoning with
probabilistic statements about where a vehicle may be now or in the future. They express
statements of the form "Object O is in spatial region R at some time t with some
probability in the interval [L,U]." Past work on SPOT databases has developed selection
operators based on selecting SPOT atoms that are entailed by the SPOT database-we call
this "cautious" selection. In this paper, we study several problems. First, we note that the
runtime of consistency checking and cautious selection algorithms in past work is
influenced greatly by the granularity of the underlying Cartesian space. In this paper, we
first introduce the notion of "optimistic" selection, where we are interested in returning
all SPOT atoms in a database that are consistent with respect to a query, rather than
having an entailment relationship. We then develop an approach to scaling SPOT
databases that has three main contributions: 1) We develop methods to eliminate
variables from the linear programs used in past work, thus greatly reducing the size of the
linear programs used-the resulting advances apply to consistency checking, optimistic
selection, and cautious selection. 2) We develop a host of theorems to show how we can
prune the search space when we are interested in optimistic selection. 3) We use the
above contributions to build an efficient index to execute optimistic selection queries
over SPOT databases. Our approach is superior to past work in two major respects: First,
it makes fewer assumptions than all past works on this topic except that in. Second, our
experiments, which are based on real-world data about ship movements, show that our
algorithms are much more efficient than those in.
91. Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially

Uncertain Data
Man Lung Yiu Mamoulis, N. Xiangyuan Dai Yufei Tao Vaitis, M.
Dept. of Comput. Sci., Aalborg Univ., Aalborg;
On page(s): 108-122
ISSN: 1041-4347
Abstract
We study the problem of answering spatial queries in databases where objects exist with
some uncertainty and they are associated with an existential probability. The goal of a
thresholding probabilistic spatial query is to retrieve the objects that qualify the spatial
predicates with probability that exceeds a threshold. Accordingly, a ranking probabilistic
spatial query selects the objects with the highest probabilities to qualify the spatial
predicates. We propose adaptations of spatial access methods and search algorithms for
probabilistic versions of range queries, nearest neighbors, spatial skylines, and reverse
nearest neighbors and conduct an extensive experimental study, which evaluates the
effectiveness of proposed solutions.
92
A Relation-Based Page Rank Algorithm for Semantic Web Search Engines

Lamberti, F. Sanna, A. Demartini, C.
Dipt. di Autom. ed Inf., Politec. di Torino, Torino;
On page(s): 123-136
ISSN: 1041-4347
Abstract
With the tremendous growth of information available to end users through the Web,
search engines come to play ever a more critical role. Nevertheless, because of their
general-purpose approach, it is always less uncommon that obtained result sets provide a
burden of useless pages. The next-generation Web architecture, represented by the
Semantic Web, provides the layered architecture possibly allowing overcoming this
limitation. Several search engines have been proposed, which allow increasing
information retrieval accuracy by exploiting a key content of semantic Web resources,
that is, relations. However, in order to rank results, most of the existing solutions need to
work on the whole annotated knowledge base. In this paper, we propose a relation-based
page rank algorithm to be used in conjunction with semantic Web search engines that
simply relies on information that could be extracted from user queries and on annotated
resources. Relevance is measured as the probability that a retrieved resource actually
contains those relations whose existence was assumed by the user at the time of query
definition.
93 CDNs Content Outsourcing via Generalized Communities

Katsaros, D. Pallis, G. Stamos, K. Vakali, A. Sidiropoulos, A. Manolopoulos, Y.
Dept. of Comput. & Commun. Eng., Thessaly Univ., Volos;
On page(s): 137-151
ISSN: 1041-4347
Abstract
Content distribution networks (CDNs) balance costs and quality in services related to
content delivery. Devising an efficient content outsourcing policy is crucial since, based
on such policies, CDN providers can provide client-tailored content, improve
performance, and result in significant economical gains. Earlier content outsourcing
approaches may often prove ineffective since they drive prefetching decisions by
assuming knowledge of content popularity statistics, which are not always available and
are extremely volatile. This work addresses this issue, by proposing a novel self-adaptive
technique under a CDN framework on which outsourced content is identified with no a-
priori knowledge of (earlier) request statistics. This is employed by using a structure-
based approach identifying coherent clusters of "correlated" Web server content objects,
the so-called Web page communities. These communities are the core outsourcing unit
and in this paper a detailed simulation experimentation has shown that the proposed
technique is robust and effective in reducing user-perceived latency as compared with
competing approaches, i.e., two communities-based approaches, Web caching, and non-
CDN.
SEABIRDS GROUP & CO

HEAD OFFICE ADDRESS
4th FLOOR SURYA COMPLEX,
SINGARATHOPE BUS STOP,,OLD MADURAI ROAD,
TRICHY- 620002
09003012150
BRANCH OFFICE
NEAR TO AMBAL MATRICULATION SCHOOL
No:23,2nd Main Road, Jagannatha Nagar
Arumbakkam, CHENNAI-106
09944361169

Ieee 2011 Java & .Net Projects at SBGC (Chennai, Trichy, India, Tamilnadu

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Ieee 2011 Java & .Net Projects at SBGC (Chennai, Trichy, India, Tamilnadu

Transféré par

Droits d'auteur :

Formats disponibles

SEABIRDS GROUP & CO

HEAD OFFICE ADDRESS

4th FLOOR SURYA COMPLEX,

SINGARATHOPE BUS STOP,,OLD MADURAI ROAD,

NEAR TO AMBAL MATRICULATION SCHOOL

No:23,2nd Main Road, Jagannatha Nagar

1. Multirelational k-Anonymity --- dotnet/java

2.Olex: Effective Rule Learning for Text Categorization – java /dotnet

3..Ranking and Suggesting Popular Items -java /dotnet

4.Similarity-Profiled Temporal Association Mining –java/dotnet

5.RiMOM: A Dynamic Multistrategy Ontology Alignment Framework –java /dotnet

Li, Juanzi Tang, Jie Li, Yi Luo, Qiong

6.IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL.

Abstract—Today, role-based access control (RBAC) has become a well-accepted

7. On the Effect of Location Uncertainty in Spatial Querying –java

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21,

9. SPOT Databases: Efficient Consistency Checking and Optimistic Selection in

Robust Blind Watermarking of Point-Sampled Geometry

11. A Hypothesis Testing Approach to Semifragile Watermark-Based

This paper studies the problem of achieving watermark semi-fragility in multimedia

12 Spread-Spectrum Watermarking Security –java./dotnet

13.A Game Theoretical Framework on Intrusion Detection in Heterogeneous

14 Unseen Visible Watermarking: A Novel Methodology for Auxiliary Information

15 A Fast Search Algorithm for a Large Fuzzy Database –java/dotnet

17 Watermarking Robustness Evaluation Based on Perceptual Quality via Genetic

18 Low-Complexity Iris Coding and Recognition Based on Directionlets –java

19 Lanczos Vectors versus Singular Vectors for Effective Dimension Reduction –

20 Optimal-Location-Selection Query Processing in Spatial Databases –java

21 Evaluating the Effectiveness of Personalized Web Search –java/dotnet

22 A Divide-and-Conquer Approach for Minimum Spanning Tree-Based

23 Determining Attributes to Maximize Visibility of Objects –java

24 Efficient Skyline Computation in Structured Peer-to-Peer Systems –java

25 A Communication Perspective on Automatic Text Categorization –java/dotnet

26 Predicting Missing Items in Shopping Carts –j2ee/dotnet

27 Discovery of Structural and Functional Features in RNA Pseudoknots-

28 Predictive Ensemble Pruning by Expectation Propagation –java

29 Rough Cluster Quality Index Based on Decision Theory-java/dotnet

30 ANGEL: Enhancing the Utility of Generalization for Privacy Preserving

31 Clustering and Sequential Pattern Mining of Online Collaborative Learning

32 Monitoring Online Tests through Data Visualization –j2ee/dotnet

Communities and Emerging Semantics in Semantic Link Network: Discovery and

34 Toward a Fuzzy Domain Ontology Extraction Method for Adaptive e-Learning

36 NNexus: An Automatic Linker for Collaborative Web-Based Corpora

37 Effective Collaboration with Information Sharing in Virtual Universities

38 Learning in an Ambient Intelligent World: Enabling Technologies and Practices

39Interactive Correction and Recommendation for Computer Language Learning and

40 Subontology-Based Resource Management for Web-Based e-Learning

41 Enhancing Learning Objects with an Ontology-Based Memory

43 An Implementation of the CORDRA Architecture Enhanced for Systematic

44 On the Expressive Power of the Relational Algebra on Finite Sets of Relation

46 Bayes Vector Quantizer for Class-Imbalance Problem

47 Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data

48 GridVideo: A Practical Example of Nonscientific Application on the Grid

49 Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster

50 Exact Knowledge Hiding through Database Extension

GLIP: A Concurrency Control Protocol for Clipping Indexing

52 Fast Query Point Movement Techniques for Large CBIR Systems

53 Schema Vacuuming in Temporal Databases

54 The Subgraph Similarity Problem

55 Histogram-Based Global Load Balancing in Structured Peer-to-Peer Systems

Progressive Parametric Query Optimization

Multiscale Representations for Fast Pattern Matching in Stream Time Series

58 Optimal Lot Sizing Policies For Sequential Online Auctions