Vous êtes sur la page 1sur 83

Clustering with Multiviewpoint-Based Similarity Measure

Abstract:
All clustering methods have to assume some cluster relationship among the data objects
that they are applied on. Similarity between a pair of objects can be defined either explicitly or
implicitly. In this paper, we introduce a novel multiviewpoint-based similarity measure and two
related clustering methods. The major difference between a traditional dissimilarity/similarity
measure and ours is that the former uses only a single viewpoint, which is the origin, while the
latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster
with the two objects being measured. Using multiple viewpoints, more informative assessment of
similarity could be achieved. Theoretical analysis and empirical study are conducted to support
this claim. Two criterion functions for document clustering are proposed based on this new
measure. We compare them with several well-known clustering algorithms that use other popular
similarity measures on various document collections to verify the advantages of our proposal.

INTRODUCTION
CLUSTERING is one of the most interesting and important topics in data mining. The aim of
clustering is to find intrinsic structures in data, and organize them into meaningful subgroups for
further study and analysis. There have been many clustering algorithms published every year.
They can be proposed for very distinct research fields, and developed using totally different
techniques and approaches.Nevertheless, according to a recent study ,more than half a century
after it was introduced, the simple algorithm k-means still remains as one of the top 10 data
mining algorithms nowadays. It is the most frequently used partitional clustering algorithm in
practice. Another recent scientific discussion [2] states that k-means is the favorite algorithm that
practitioners in the related fields choose to use. Needless to mention, k-means has more than a
few basic drawbacks, such as sensitiveness to initialization and to cluster size, and its
performance can be worse than other state-of-the-art algorithms in many domains. In spite of
that,its simplicity, understandability, and scalability are the reasons for its tremendous popularity.
An algorithm with adequate performance and usability in most of application scenarios could be
preferable to one with better performance in some cases but limited usage due to high
complexity.While offering reasonable results, k-means is fast and easy to combine with other
methods in larger systems.A common approach to the clustering problem is to treat it as an
optimization process. An optimal partition is found by optimizing a particular function of
similarity (or distance) among data. Basically, there is an implicit assumption that the true
intrinsic structure of data could be correctly described by the similarity formula defined and
embedded in the clustering criterion function. Hence,effectiveness of clustering algorithms under
this approach depends on the appropriateness of the similarity measure to the data at hand. For
instance, the original k-means has sum-of-squared-error objective function that uses euclidean
distance. In a very sparse and high-dimensional domain like text documents, spherical k-means,
which uses cosine similarity (CS) instead of euclidean distance as the measure,is deemed to be
more suitable .In, Banerjee et al. showed that euclidean distance was indeed one particular form
of a class of distance measures called Bregman divergences. They proposed Bregman
hardclustering algorithm, in which any kind of the Bregman divergences could be applied.
Kullback-Leibler divergence was a special case of Bregman divergences that was said to give
good clustering results on document data sets. Kullback-Leibler divergence is a good example of
nonsymmetric measure. Also on the topic of capturing dissimilarity in data,Pakalska et al. found

that the discriminative power of some distance measures could increase when their nonEuclidean and nonmetric attributes were increased. They concluded that noneuclidean and
nonmetric measures could be informative for statistical learning of data. In , Pelillo even argued
that the symmetry and nonnegativity assumption of similarity measures was actually a limitation
of current state-of-the-art clustering approaches. Simultaneously, clustering
still requires more robust dissimilarity or similarity measures; recent works such as [8] illustrate
this need.The work in this paper is motivated by investigations from the above and similar
research findings. It appears to us that the nature of similarity measure plays a very important
role in the success or failure of a clustering method. Our first objective is to derive a novel
method for measuring similarity between data objects in sparse and high-dimensional domain,
particularly text documents. From the proposed similarity measure, we then formulate new
clustering criterion functions and introduce their respective clustering algorithms, which are fast
and scalable like k-means, but are also capable of providing high-quality and consistent
performance.

Existing System
A common approach to the clustering problem is to treat it as an optimization process. An
optimal partition is found by optimizing a particular function of similarity (or distance) among
data. Basically, there is an implicit assumption that the true intrinsic structure of data could be
correctly described by the similarity formula defined and embedded in the clustering criterion
function. Hence, effectiveness of clustering algorithms under this approach depends on the
appropriateness of the similarity measure to the data at hand. For instance, the original k-means
has sum-of-squared-error objective function that uses Euclidean distance. In a very sparse and
high-dimensional domain like text documents, spherical k-means, which uses cosine similarity
(CS) instead of Euclidean distance as the measure, is deemed to be more suitable.

Proposed System:
The work in this paper is motivated by investigations from the above and similar research
findings. It appears to us that the nature of similarity measure plays a very important role in the
success or failure of a clustering method. Our first objective is to derive a novel method for
measuring similarity between data objects in sparse and high-dimensional domain, particularly
text documents. From the proposed similarity measure, we then formulate new clustering

criterion functions and introduce their respective clustering algorithms, which are fast and
scalable like k-means, but are also capable of providing high-quality and consistent performance.
Software Requirement Specification
Software Specification
Operating System

Windows XP

Technology

JAVA 1.6,Jfreechart

Minimum Hardware Specification


Processor

Pentium IV

RAM

512 MB

Hard Disk

80GB

Modules
Select File
Process
Histogram
Clusters
Similarity
Result

TECHNOLOGIES USED
4.1 Introduction To Java:
Java has been around since 1991, developed by a small team of Sun Microsystems
developers in a project originally called the Green project. The intent of the project was to
develop a platform-independent software technology that would be used in the consumer
electronics industry. The language that the team created was originally called Oak.
The first implementation of Oak was in a PDA-type device called Star Seven (*7) that
consisted of the Oak language, an operating system called GreenOS, a user interface, and
hardware. The name *7 was derived from the telephone sequence that was used in the team's
office and that was dialed in order to answer any ringing telephone from any other phone in the
office.
Around the time the First Person project was floundering in consumer electronics, a new
craze was gaining momentum in America; the craze was called "Web surfing." The World Wide
Web, a name applied to the Internet's millions of linked HTML documents was suddenly
becoming popular for use by the masses. The reason for this was the introduction of a graphical
Web browser called Mosaic, developed by ncSA. The browser simplified Web browsing by

combining text and graphics into a single interface to eliminate the need for users to learn many
confusing UNIX and DOS commands. Navigating around the Web was much easier using
Mosaic.
It has only been since 1994 that Oak technology has been applied to the Web. In 1994,
two Sun developers created the first version of Hot Java, and then called Web Runner, which is a
graphical browser for the Web that exists today. The browser was coded entirely in the Oak
language, by this time called Java. Soon after, the Java compiler was rewritten in the Java
language from its original C code, thus proving that Java could be used effectively as an
application language. Sun introduced Java in May 1995 at the Sun World 95 convention.

Web surfing has become an enormously popular practice among millions of computer
users. Until Java, however, the content of information on the Internet has been a bland series of
HTML documents. Web users are hungry for applications that are interactive, that users can
execute no matter what hardware or software platform they are using, and that travel across
heterogeneous networks and do not spread viruses to their computers. Java can create such
applications.
The Java programming language is a high-level language that can be characterized by all
of the following buzzwords:

Simple

Architecture neutral

Object oriented

Portable

Distributed

High performance

Interpreted

Multithreaded

Robust

Dynamic

Secure
With most programming languages, you either compile or interpret a program so that you

can run it on your computer. The Java programming language is unusual in that a program is
both compiled and interpreted. With the compiler, first you translate a program into an
intermediate language called Java byte codes the platform-independent codes interpreted by
the interpreter on the Java platform. The interpreter parses and runs each Java byte code
instruction on the computer. Compilation happens just once; interpretation occurs each time the
program is executed. The following figure illustrates how this works.

Figure 4.1: Working Of Java


You can think of Java bytecodes as the machine code instructions for the java virtual
machine (Java VM). Every Java interpreter, whether its a development tool or a Web browser
that can run applets, is an implementation of the Java VM. Java bytecodes help make write
once, run anywhere possible. You can compile your program into bytecodes on any platform
that has a Java compiler. The bytecodes can then be run on any implementation of the Java VM.
That means that as long as a computer has a Java VM, the same program written in the Java
programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
The Java Platform:
A platform is the hardware or software environment in which a program runs. Weve
already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and
MacOS. Most platforms can be described as a combination of the operating system and
hardware. The Java platform differs from most other platforms in that its a software-only
platform that runs on top of other hardware-based platforms.

The Java platform has two components:


The java virtual mechine (Java VM)
The java application programming interface (Java API)
Youve already been introduced to the Java VM. Its the base for the Java platform and is
ported onto various hardware-based platforms.
The Java API is a large collection of ready-made software components that provide many
useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into
libraries of related classes and interfaces; these libraries are known as packages. The next
section, What Can Java Technology Do?, highlights what functionality some of the packages in
the Java API provide.
The following figure depicts a program thats running on the Java platform. As the figure
shows, the Java API and the virtual machine insulate the program from the hardware.

Figure 4.2: The Java Platform


Native code is code that after you compile it, the compiled code runs on a specific
hardware platform. As a platform-independent environment, the Java platform can be a bit
slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time
bytecode compilers can bring performance close to that of native code without threatening
portability.
Working Of Java:
For those who are new to object-oriented programming, the concept of a class will be
new to you. Simplistically, a class is the definition for a segment of code that can contain both
data and functions.When the interpreter executes a class, it looks for a particular method by the

name of main, which will sound familiar to C programmers. The main method is passed as a
parameter an array of strings (similar to the argv[] of C), and is declared as a static method.
To output text from the program, iexecute the println method of System. out, which is
javas output stream. UNIX users will appreciate the theory behind such a stream, as it is actually
standard output. For those who are instead used to the Wintel platform, it will write the string
passed to it to the users program.

4.2 Swing:
Introduction To Swing:
Swing contains all the components.

Its a big library, but its designed to have

appropriate complexity for the task at hand if something is simple, you dont have to write
much code but as you try to do more your code becomes increasingly complex. This means an
easy entry point, but youve got the power if you need it.
Swing has great depth. This section does not attempt to be comprehensive, but instead
introduces the power and simplicity of Swing to get you started using the library. Please be
aware that what you see here is intended to be simple. If you need to do more, then Swing can
probably give you what you want if youre willing to do the research by hunting through the
online documentation from Sun.
Benefits Of Swing:
Swing components are Beans, so they can be used in any development environment that
supports Beans. Swing provides a full set of UI components. For speed, all the components are
lightweight and Swing is written entirely in Java for portability.
Swing could be called orthogonality of use; that is, once you pick up the general ideas
about the library you can apply them everywhere. Primarily because of the Beans naming
conventions.

Keyboard navigation is automatic you can use a Swing application without the mouse,
but you dont have to do any extra programming. Scrolling support is effortless you simply
wrap your component in a JScrollPane as you add it to your form. Other features such as tool tips
typically require a single line of code to implement.
Swing also supports something called pluggable look and feel, which means that the
appearance of the UI can be dynamically changed to suit the expectations of users working under
different platforms and operating systems. Its even possible to invent your own look and feel.

Domain Description:
Data mining involves the use of sophisticated data analysis tools to discover previously
unknown, valid patterns and relationships in large data sets. These tools can include statistical
models, mathematical algorithms, and machine learning methods (algorithms that improve their
performance automatically through experience, such as neural networks or decision trees).
Consequently, data mining consists of more than collecting and managing data, it also includes
analysis and prediction.
Data mining can be performed on data represented in quantitative, textual, or multimedia
forms. Data mining applications can use a variety of parameters to examine the data. They
include association (patterns where one event is connected to another event, such as purchasing a
pen and purchasing paper), sequence or path analysis (patterns where one event leads to another
event, such as the birth of a child and purchasing diapers), classification (identification of new
patterns, such as coincidences between duct tape purchases and plastic sheeting purchases),
clustering (finding and visually documenting groups of previously unknown facts, such as
geographic location and brand preferences), and forecasting (discovering patterns from which
one can make reasonable predictions regarding future activities, such as the prediction that
people who join an athletic club may take exercise classes)

Figure 4.3 knowledge discovery process


Data Mining Uses:
Data mining is used for a variety of purposes in both the private and public sectors.
Industries such as banking, insurance, medicine, and retailing commonly use data mining
to reduce costs, enhance research, and increase sales. For example, the insurance and
banking industries use data mining applications to detect fraud and assist in risk
assessment (e.g., credit scoring).
Using customer data collected over several years, companies can develop models that
predict whether a customer is a good credit risk, or whether an accident claim may be
fraudulent and should be investigated more closely.

The medical community sometimes uses data mining to help predict the effectiveness of a
procedure or medicine.
Pharmaceutical firms use data mining of chemical compounds and genetic material to
help guide research on new treatments for diseases.
Retailers can use information collected through affinity programs (e.g., shoppers club cards,
frequent flyer points, contests) to assess the effectiveness of product selection and placement
decisions, coupon offers, and which products are often purchased together.

DESIGN ANALYSIS
UML Diagrams:
UML is a method for describing the system architecture in detail using the blueprint.
UML represents a collection of best engineering practices that have proven successful in the
modeling of large and complex systems.
UML is a very important part of developing objects oriented software and the software
development process.
UML uses mostly graphical notations to express the design of software projects.
Using the UML helps project teams communicate, explore potential designs, and validate
the architectural design of the software.
Definition:
UML is a general-purpose visual modeling language that is used to specify, visualize, construct,
and document the artifacts of the software system.
UML is a language:
It will provide vocabulary and rules for communications and function on conceptual and physical
representation. So it is modeling language.
UML Specifying:
Specifying means building models that are precise, unambiguous and complete. In particular, the
UML address the specification of all the important analysis, design and implementation decisions
that must be made in developing and displaying a software intensive system.
UML Visualization:
The UML includes both graphical and textual representation. It makes easy to visualize the
system and for better understanding.

UML Constructing:
UML models can be directly connected to a variety of programming languages and it is
sufficiently expressive and free from any ambiguity to permit the direct execution of models.
UML Documenting:
UML provides variety of documents in addition raw executable codes.

Figure 3.4 Modeling a System Architecture using views of UML

The use case view of a system encompasses the use cases that describe the behavior of the
system as seen by its end users, analysts, and testers.
The design view of a system encompasses the classes, interfaces, and collaborations that form the
vocabulary of the problem and its solution.
The process view of a system encompasses the threads and processes that form the system's
concurrency and synchronization mechanisms.
The implementation view of a system encompasses the components and files that are used to
assemble and release the physical system.The deployment view of a system encompasses the

nodes that form the system's hardware topology on which the system executes.

Uses of UML :
The UML is intended primarily for software intensive systems. It has been used
effectively for such domain as
Enterprise Information System
Banking and Financial Services
Telecommunications
Transportation
Defense/Aerosp
Retails
Medical Electronics

Scientific Fields
Distributed Web
Building blocks of UML:
The vocabulary of the UML encompasses 3 kinds of building blocks
Things
Relationships
Diagrams
Things:
Things are the data abstractions that are first class citizens in a model. Things are of 4 types
Structural Things, Behavioral Things ,Grouping Things, An notational Things

Relationships:
Relationships tie the things together. Relationships in the UML are
Dependency, Association, Generalization, Specialization
UML Diagrams:
A diagram is the graphical presentation of a set of elements, most often rendered as a connected
graph of vertices (things) and arcs (relationships).
There are two types of diagrams, they are:
Structural and Behavioral Diagrams
Structural Diagrams:The UMLs four structural diagrams exist to visualize, specify, construct and document
the static aspects of a system. ican View the static parts of a system using one of the following

diagrams. Structural diagrams consists of Class Diagram, Object Diagram, Component Diagram,
Deployment Diagram.
Behavioral Diagrams :
The UMLs five behavioral diagrams are used to visualize, specify, construct, and
document the dynamic aspects of a system. The UMLs behavioral diagrams are roughly
organized around the major ways which can model the dynamics of a system.
Behavioral diagrams consists of
Use case Diagram, Sequence Diagram, Collaboration Diagram, State chart Diagram, Activity
Diagram

3.2.1 Use-Case diagram:


A use case is a set of scenarios that describing an interaction between a user and a
system. A use case diagram displays the relationship among actors and use cases. The two main
components of a use case diagram are use cases and actors.

An actor is represents a user or another system that will interact with the system you are
modeling. A use case is an external view of the system that represents some action the user
might perform in order to complete a task.
Contents:

Use cases

Actors

Dependency, Generalization, and association relationships

System boundary

select path

process

histogram

clusters

similarity

Result

3.2.2 Class Diagram:


Class diagrams are widely used to describe the types of objects in a system and their
relationships. Class diagrams model class structure and contents using design elements such as

classes, packages and objects. Class diagrams describe three different perspectives when
designing a system, conceptual, specification, and implementation. These perspectives become
evident as the diagram is created and help solidify the design. Class diagrams are arguably the
most used UML diagram type. It is the main building block of any object oriented solution. It
shows the classes in a system, attributes and operations of each class and the relationship
between each class. In most modeling tools a class has three parts, name at the top, attributes in
the middle and operations or methods at the bottom. In large systems with many classes related
classes are grouped together to to create class diagrams. Different relationships between
diagrams are show by different types of Arrows. Below is a image of a class diagram. Follow the
link for more class diagram examples.

UML Class Diagram with Relationships

select file
+file()

Process
+process()

Histogram
+histogram()

result
Clusters
+cluster()

Similarity
+similarity()

Sequence Diagram
Sequence diagrams in UML shows how object interact with each other and the order those
interactions occur. Its important to note that they show the interactions for a particular scenario.
The processes are represented vertically and interactions are show as arrows. This article
explains thepurpose and the basics of Sequence diagrams.

/ process

/ select file

/ histogram

/ clusters

/ similarity

/ result

1 : select file()

2 : process the file()

3 : divide histograms()

4 : divide clusters()
5 : no of similarities()
6 : result()

Collaboration diagram
Communication diagram was called collaboration diagram in UML 1. It is similar to sequence
diagrams but the focus is on messages passed between objects. The same information can be
represented using a sequence diagram and different objects. Click here to understand the
differences using an example.

/ process

/ histogram
/ select file
/ similarity

/ clusters

/ result

State machine diagrams


State machine diagrams are similar to activity diagrams although notations and usage changes a
bit. They are sometime known as state diagrams or start chart diagrams as well. These are very
useful to describe the behavior of objects that act different according to the state they are at the
moment. Below State machine diagram show the basic states and actions.

select file

process

histograms

clusters

similarity

result

State Machine diagram in UML, sometime referred to as State or State chart diagram

3.2.3 Activity diagram:


Activity Diagram:
Activity diagrams describe the workflow behavior of a system. Activity diagrams are
similar to state diagrams because activities are the state of doing something. The diagrams
describe the state of activities by showing the sequence of activities performed. Activity
diagrams can show activities that are conditional or parallel.
How to Draw: Activity Diagrams
Activity diagrams show the flow of activities through the system. Diagrams are read
from top to bottom and have branches and forks to describe conditions and parallel activities. A
fork is used when multiple activities are occurring at the same time. The diagram below shows a
fork after activity1. This indicates that both activity2 and activity3 are occurring at the same
time. After activity2 there is a branch. The branch describes what activities will take place
based on a set of conditions. All branches at some point are followed by a merge to indicate the
end of the conditional behavior started by that branch. After the merge all of the parallel
activities must be combined by a join before transitioning into the final activity state. .
When to Use: Activity Diagrams
Activity diagrams should be used in conjunction with other modeling techniques such
as interaction diagrams and state diagrams. The main reason to use activity diagrams is to model
the workflow behind the system being designed. Activity Diagrams are also useful for:
analyzing a use case by describing what actions need to take place and when they should
occur; describing a complicated sequential algorithm; and modeling applications with parallel
processes.

select file

process

histogram

clusters

similarity

result

Component diagram
]A component diagram displays the structural relationship of components of a software system.
These are mostly used when working with complex systems that has many components.
Components communicate with each other using interfaces. The interfaces are linked using
connectors. Below images shows a component diagram.

Deployment Diagram
A deployment diagrams shows the hardware of your system and the software in those hardware.
Deployment diagrams are useful when your software solution is deployed across multiple
machines with each having a unique configuration. Below is an example deployment diagram.

UML Deployment Diagram ( Click on the image to use it as a template )

SAMPLE CODE

//Bit.java
import java.io.*;
import java.lang.*;
import java.util.*;
////////////////////Bit class
class Bit
{
//bit operations
public static int Power(int tBase,int tExponent)
{
int tAns=1,t;
for(t=1;t<=tExponent;t++)
{
tAns=tAns*tBase;
}
return(tAns);
}
public static int GetBit(int tValue,int tPos)
{
int tBit=0;
tBit=tValue&Power(2,tPos);
if(tBit>0) tBit=1;
return(tBit);
}

public static String DecToBin(int tValue,int tLength)


{
String tBitStr="";
int t;
for(t=0;t<=tLength-1;t++)
{
tBitStr=GetBit(tValue,t)+tBitStr;
}
return(tBitStr);
}
public static int GetBitOnCount(int tValue,int tLength)
{
String tBitStr;
int t,tCount=0;
tBitStr=DecToBin(tValue,tLength);
for(t=1;t<=tLength;t++)
{
if(tBitStr.substring(t-1,t).equals("1")==true)
{
tCount=tCount+1;
}
}
return(tCount);
}
}
//Dict.java
import java.io.*;
import java.lang.*;

import java.util.*;
class Dict
{
String words[];
int nwords;
int iwords[];
//constructor
Dict()
{
int maxWords=150000;
nwords=0;
words=new String[maxWords];
iwords=new int[26];
for(int t=0;t<26;t++) iwords[t]=0;
}
//methods
public void read_dictionary()
{
try
{
//System.out.println("Reading...");
//System.out.println("GNU Collaborative International Dictionary of
English (GCIDE)\n");
for(int i=0;i<26;i++)
{

String tfpath="dict\\gcide\\words_"+(char)(97+i)+".txt";
FileInputStream fin;
fin=new FileInputStream(tfpath);
int ch=0;
String tmp="";
while((ch=fin.read())!=-1)
{
if(ch==13)
{
addWord(tmp,i);
tmp="";
fin.read();
continue;
}
tmp+=(char)ch;
}
//System.out.println("gcide_"+(char)(97+i)+": "+iwords[i]+"
words");
}
}
catch(Exception e)
{
//System.out.println("Error: "+e.getMessage());
}
}
public void addWord(String tword,int alphabetIndex)
{

words[nwords]=tword;
nwords++;
iwords[alphabetIndex]++;
}
public boolean isWord(String tword)
{
boolean flag=false;
tword=tword.toLowerCase();
for(int t=0;t<nwords;t++)
{
if(tword.compareTo(words[t])==0)
{
flag=true;
break;
}
}
return(flag);
}
public String toString()
{
String tstr="";
tstr="\nTotal: "+nwords+" words";
return(tstr);
}
}

//DocumentIndexGraph.java
import java.lang.*;
import java.io.*;
public class DocumentIndexGraph
{
Itemset V;
ItemsetCollection E;
//constructor
public DocumentIndexGraph()
{
V=new Itemset();
E=new ItemsetCollection();
}
//get functions
public Itemset getV()
{
return(V);
}
public ItemsetCollection getE()
{
return(E);
}
//set functions
public void setV(Itemset tItemset)

{
V.clear();
V.appendItemset(tItemset);
}
public void setE(ItemsetCollection tItemsetCollection)
{
E.clear();
E.appendItemsetCollection(tItemsetCollection);
}
//methods
public void addNode(String tWord)
{
V.addItem(tWord);
}
public void addEdge(Itemset tEdge)
{
E.addItemset(tEdge);
}
public boolean isEdge(String str1,String str2)
{
boolean flag=false;
for(int t=0;t<=E.get_nItemsets()-1;t++)
{
String tstr1=E.getItemset(t).getItem(0);
String tstr2=E.getItemset(t).getItem(1);

if(str1.compareToIgnoreCase(tstr1)==0&&str2.compareToIgnoreCase(tstr2)==0)
{
flag=true;
break;
}
}
return(flag);
}
public boolean isPath(String str)
{
String tarr[]=StringUtils.split(str," ");
boolean flag=true;
for(int t=0;t<=tarr.length-2;t++)
{
if(isEdge(tarr[t],tarr[t+1])==false)
{
flag=false;
break;
}
}
return(flag);
}
public double findPhrasePathWeight(String str)
{
String tarr[]=StringUtils.split(str," ");

int tCount=0;
for(int t=0;t<=tarr.length-2;t++)
{
if(isEdge(tarr[t],tarr[t+1])==true)
{
tCount+=1;
}
}
double weight=(double)tCount/(double)(V.get_nItems());
return(weight);
}
}
//Hierarchical Clustering.java
import java.io.*;
import java.net.*;
import java.awt.*;
import java.awt.event.*;
import java.util.*;
import javax.swing.*;
import javax.swing.filechooser.*;
import org.jfree.ui.RefineryUtilities;
class Hier extends JFrame implements ActionListener
{
JFrame frmRootPath = new JFrame("Root Path : Clustering with Multi-Viewpoint based
Similarity Measure");
JFrame frmButton = new JFrame("Functions : Clustering with Multi-Viewpoint based
Similarity Measure");

JFrame frmResult = new JFrame("Result : Clustering with Multi-Viewpoint based


Similarity Measure");
JLabel lblRootPath=new JLabel("RootPath:");
JList lstRootPath=new JList();
JScrollPane spLinks=new JScrollPane(lstRootPath);
JButton btProcess=new JButton("Process");
ItemsetCollection Similarities=new ItemsetCollection();
JButton btHistogram=new JButton("Histogram");
JButton btCluster=new JButton("Clusters");
JButton btSimilarity=new JButton("Similarity");
ItemsetCollection Hist=new ItemsetCollection();
JLabel lblResult=new JLabel("Result:");
JTextArea txtResult=new JTextArea("");
JScrollPane spResult=new JScrollPane(txtResult);
//system parameters
double simalpha=0.6;
WebPageRetrieval tweb=new WebPageRetrieval();
Dict dict=new Dict();
double[][][] sim,sim_perc;
HTML_Parser parser1=new HTML_Parser();
Queue frontier=new Queue();
int maxPages=1000;
String[] visitedPages=new String[maxPages];
int nVisited=0;
String logPath="visitlog.txt";
String logText="";
//init documents
int nDocuments;

WebDocument documents[];
WebDocument CumulativeDocument;
ItemsetCollection Clusters=new ItemsetCollection();
Hier()
{
//Root path frame
frmRootPath.setDefaultLookAndFeelDecorated(true);
frmRootPath.setResizable(false);
frmRootPath.setBounds(50,50,400,400);
frmRootPath.getContentPane().setLayout(null);
//Functions frame
frmButton.setDefaultLookAndFeelDecorated(true);
frmButton.setResizable(false);
frmButton.setBounds(50,50,201,380);
frmButton.getContentPane().setLayout(null);
//Result frame
frmResult.setDefaultLookAndFeelDecorated(true);
frmResult.setResizable(false);
frmResult.setBounds(50,50,600,580);
frmResult.getContentPane().setLayout(null);
//Root path Design
lblRootPath.setBounds(50,15,100,20);
frmRootPath.getContentPane().add(lblRootPath);
spLinks.setBounds(48,35,270,200);
frmRootPath.getContentPane().add(spLinks);
//Process button Design

btProcess.setBounds(50,65,100,20);
btProcess.addActionListener(this);
frmButton.getContentPane().add(btProcess);
//Histogram button Design
btHistogram.setBounds(50,125,100,20);
btHistogram.addActionListener(this);
frmButton.getContentPane().add(btHistogram);
//Cluster button Design
btCluster.setBounds(50,185,100,20);
btCluster.addActionListener(this);
frmButton.getContentPane().add(btCluster);
//Similarity button Design
btSimilarity.setBounds(50,245,100,20);
btSimilarity.addActionListener(this);
frmButton.getContentPane().add(btSimilarity);
//Result Design
lblResult.setBounds(17,35,100,20);
frmResult.getContentPane().add(lblResult);
spResult.setBounds(15,55,540,450);
frmResult.getContentPane().add(spResult);
txtResult.setEditable(false);
//initialize lstRootPath
FileSystemView fv=FileSystemView.getFileSystemView();
File files[]=fv.getFiles(new File("data"),true);
Vector tvector=new Vector();
for(int t=0;t<files.length;t++)

{
String tFileName=fv.getSystemDisplayName(files[t]);
tvector.add(tFileName);
}
lstRootPath.setSelectionMode(ListSelectionModel.SINGLE_SELECTION);
lstRootPath.setListData(tvector);
lstRootPath.setSelectedIndex(0);
frmRootPath.setVisible(true);
frmButton.setVisible(true);
frmResult.setVisible(true);
}

public void actionPerformed(ActionEvent evt)


{
if(evt.getSource()==btProcess)
{
process();
}
if(evt.getSource()==btHistogram)
{
Histogram();
}
if(evt.getSource()==btCluster)
{
Cluster();
}

if(evt.getSource()==btSimilarity)
{
Similarity();
}
}
public void process()
{
try
{
dict.read_dictionary();
//starting-urls
String tRootPath=(String)lstRootPath.getSelectedValue();
frontier.enqueue(tRootPath);
//breadth-first-search
nVisited=0;
txtResult.setText("");
logText="";
while(nVisited<maxPages&&frontier.isEmpty()==false)
{
String tstrFrontier=frontier.toString();
String tPath=frontier.dequeue();
if(isVisitedPage(tPath)==false)
{
logText+="Frontier: "+tstrFrontier+"\n";
addVisitedPage(tPath);
logText+="Downloading ["+tPath+"]..."+"\n";

parser1.setFilePath("data\\"+tPath);
Queue q=parser1.findLinks();
logText+="ExtractedLinks: "+q.toString()+"\n";
printVisitedPages();
logText+="\n";
while(q.isEmpty()==false)
{
String tPath1=q.dequeue();
if(isVisitedPage(tPath1)==false)
frontier.enqueue(tPath1);
}
}
}
//write visitlog
FileOutputStream foutlog=new FileOutputStream(logPath);
foutlog.write(logText.getBytes());
foutlog.close();
//construct webdocuments and its DIG
nDocuments=nVisited;
documents=new WebDocument[nDocuments];
CumulativeDocument=new WebDocument();
//find metas and construct cumulative document index graph
addResultText(" Clustering with Multi-Viewpoint based Similarity
Measure:\n\n");
ItemsetCollection icWords=new ItemsetCollection();
ItemsetCollection icEdges=new ItemsetCollection();
for(int t=0;t<nDocuments;t++) //for each document
{

documents[t]=new WebDocument();
documents[t].setDocName(visitedPages[t]);
parser1.setFilePath("data\\"+visitedPages[t]);
Queue q=parser1.findMetas(); //get meta-data
String tstr=q.toString();
tstr=StringUtils.replaceString(tstr,",","");
tstr=StringUtils.replaceString(tstr,"{","");
tstr=StringUtils.replaceString(tstr,"}","");
//get unique words in this document
String tarr[]=StringUtils.split(tstr," ");
Itemset tItemset=new Itemset(tarr);
ItemsetCollection ic1=new ItemsetCollection(tItemset);
tItemset=ic1.getUniqueItemset();
simalpha=0.3;
//suppress non-dictionary words
for(int t1=0;t1<tItemset.get_nItems();t1++)
{
if(dict.isWord(tItemset.getItem(t1))==false)
{
//tItemset.removeItem(t1);
}
}
icWords.addItemset(tItemset);
documents[t].DIG.setV(tItemset);
//get unique edges in this document
tstr=q.toString();
tstr=StringUtils.replaceString(tstr,"{","");
tstr=StringUtils.replaceString(tstr,"}","");
tarr=StringUtils.split(tstr,", ");
for(int j=0;j<tarr.length;j++)
{

documents[t].addPhrase(tarr[j]);
CumulativeDocument.addPhrase(tarr[j]);
String[] tarr1=StringUtils.split(tarr[j]," ");
if(tarr1.length>1)
{
for(int k=0;k<=tarr1.length-2;k++)
{
//for(int k1=k+1;k1<=tarr1.length-1;k1+
+) //if word-(k+1) appears before word-k
//{
Itemset i1=new Itemset(); //if word(k+1) appears next to word-k
i1.addItem(tarr1[k]);
i1.addItem(tarr1[k+1]);
icEdges.addItemset(i1);
documents[t].DIG.addEdge(i1);
//}
}
}
}
}
//set graph nodes and edges
for(int t=0;t<nDocuments;t++)
{
ItemsetCollection ic1=new ItemsetCollection();
ic1=documents[t].DIG.getE();
documents[t].DIG.setE(ic1.getUniqueItemsetCollection());
}
CumulativeDocument.DIG.setV(icWords.getUniqueItemset());
CumulativeDocument.DIG.setE(icEdges.getUniqueItemsetCollection());

//show each document phrases and dig


for(int t=0;t<nDocuments;t++)
{
addResultText("Document"+t+": "+documents[t].getDocName()
+"\n");
addResultText("Phrases:\n"+documents[t].getPhrases().toString()
+"\n");
addResultText("Nodes:\n"+documents[t].getDIG().getV().toString()+"\n");
addResultText("Edges:\n"+documents[t].getDIG().getE().toString1()+"\n");
}
//set cumulative dig
addResultText("\nCumulative DIG:\n");
addResultText("Phrases:\n"+CumulativeDocument.getPhrases().toString()
+"\n");
addResultText("Nodes:\n"+CumulativeDocument.DIG.getV().toString());
addResultText("\nEdges:\n"+CumulativeDocument.DIG.getE().toString1());
//initialize clusters
//ItemsetCollection Clusters=new ItemsetCollection();
for(int t=0;t<nDocuments;t++)
{
Clusters.addItemset(new Itemset(""+t));
}
//construct histogram
//ItemsetCollection Hist=new ItemsetCollection();

double HRmin=1.0;
double HRmax=0.0;
/*for(int t=0;t<=nDocuments-2;t++)
{
for(int j=t+1;j<=nDocuments-1;j++)
{
double tsim=findSimilarity(documents[t],documents[j]);
if(HRmin>tsim) HRmin=tsim;
if(HRmax<tsim) HRmax=tsim;
}
}*/
//clustering
addResultText("\nSimilarities and its Corresponding OLP:\n");
Similarities=new ItemsetCollection();
double similarityThreshold=0.3;
sim=new double[nDocuments][nDocuments][1];
sim_perc=new double[nDocuments][nDocuments][1];
for(int t=0;t<=nDocuments-1;t++)
{
for(int j=0;j<=nDocuments-1;j++)
{
double hratio=findSimilarity(documents[t],documents[j]);
Itemset i1=new Itemset();
i1.addItem(""+t);
i1.addItem(""+j);
i1.addItem(""+hratio);
Similarities.addItemset(i1);
addResultText(" sim("+t+","+j+") :
"+hratio+"\n");
sim[t][j][0]=hratio;

OLP -->

sim_perc[t][j][0]=hratio*100;
if(hratio>=similarityThreshold)
{
String tstr1=""+t;
String tstr2=""+j;
int tNewClusterIndex=-1;
int tOldClusterIndex=-1;
for(int i=0;i<=Clusters.get_nItemsets()-1;i++)
{
if(Clusters.getItemset(i).isContains(tstr1)==true)
{
tNewClusterIndex=i;
}
if(Clusters.getItemset(i).isContains(tstr2)==true)
{
tOldClusterIndex=i;
}
}
if(tNewClusterIndex!=-1&&tOldClusterIndex!=-1)
{
Clusters.getItemset(tOldClusterIndex).removeItem(tstr2);
Clusters.getItemset(tNewClusterIndex).addItem(tstr2);
}
}
}

}
}
catch(IOException e)
{
System.out.println(e);
}
}
//display histogram
public void Histogram()
{
try
{
for(int i=0;i<nDocuments;i++)
{
Histogram hist = new Histogram("Document "+i+" Similiarity",sim_perc[i]);
hist.pack();
RefineryUtilities.centerFrameOnScreen(hist);
hist.setVisible(true);
}
/*txtResult.setText("");
//ItemsetCollection Hist=new ItemsetCollection();
//ItemsetCollection Similarities=new ItemsetCollection();
addResultText("\nHistogram:\n");
double tstart=0.0f;
double tinterval=0.1f;
for(int t=0;t<10;t++)
{
int tCount=0;
for(int j=0;j<=Similarities.get_nItemsets()-1;j++)
{

double
tsim=Double.parseDouble(Similarities.getItemset(j).getItem(2));
if(tsim>=tstart&&tsim<=tstart+tinterval)
{
tCount++;
}
}
Hist.addItemset(new Itemset(""+tCount));
addResultText("("+tstart+","+(tstart+tinterval)+"):
"+Hist.getItemset(Hist.get_nItemsets()-1)+"\n");
tstart+=tinterval;
}*/
}
catch(Exception e)
{
System.out.println(e);
}
}
//display clusters
public void Cluster()
{
try{
txtResult.setText("");
//ItemsetCollection Clusters=new ItemsetCollection();
addResultText("\nClusters With the Obtained OLP :\n");
int nClusters=0;
for(int t=0;t<=Clusters.get_nItemsets()-1;t++)
{
if(Clusters.getItemset(t).get_nItems()!=0)
{

addResultText("Cluster"+(nClusters+1)+":
"+Clusters.getItemset(t).toString()+"\n");
//

nCm(tstr2);

//Clusters.getItemset(tNewClusterIndex).addItem(tstr2);
nClusters+=1;
}
}

}
catch(Exception e)
{
}
}
public void Similarity()
{
try
{
txtResult.setText("");
ItemsetCollection Clusters=new ItemsetCollection();
addResultText("\nSimilarities and its Corresponding OLP:\n");
ItemsetCollection Similarities=new ItemsetCollection();
double similarityThreshold=0.3;
for(int t=0;t<=nDocuments-1;t++)
{
for(int j=0;j<=nDocuments-1;j++)
{
double hratio=findSimilarity(documents[t],documents[j]);
Itemset i1=new Itemset();
i1.addItem(""+t);

i1.addItem(""+j);
i1.addItem(""+hratio);
Similarities.addItemset(i1);
addResultText(" sim("+t+","+j+") :

OLP -->

"+hratio+"\n");
if(hratio>=similarityThreshold)
{
String tstr1=""+t;
String tstr2=""+j;
int tNewClusterIndex=-1;
int tOldClusterIndex=-1;
for(int i=0;i<=Clusters.get_nItemsets()-1;i++)
{
if(Clusters.getItemset(i).isContains(tstr1)==true)
{
tNewClusterIndex=i;
}
if(Clusters.getItemset(i).isContains(tstr2)==true)
{
tOldClusterIndex=i;
}
}
if(tNewClusterIndex!=-1&&tOldClusterIndex!=-1)
{
Clusters.getItemset(tOldClusterIndex).removeItem(tstr2);

Clusters.getItemset(tNewClusterIndex).addItem(tstr2);
}
}
}
}
}
catch(Exception e)
{
}
}

double findSimilarity(WebDocument d1,WebDocument d2)


{
double simp=findPhraseSimilarity(d1,d2);
double simt=findTermSimilarity(d1,d2);
double sim=(simalpha*simp)+((1.0-simalpha)*simt);
return(sim);
}
double findPhraseSimilarity(WebDocument d1,WebDocument d2)
{
WebDocument doc1=CombineDocument(d1,d2);
//find sigmaj
double sigmaj=0.0;
for(int t=0;t<d1.getPhrases().get_nItems();t++)

{
double s1j=StringUtils.split(d1.getPhrase(t)," ").length;
double tweight=doc1.DIG.findPhrasePathWeight(d1.getPhrase(t));
sigmaj+=s1j*tweight;
}
//find sigmak
double sigmak=0.0;
for(int t=0;t<d2.getPhrases().get_nItems();t++)
{
double s2k=StringUtils.split(d2.getPhrase(t)," ").length;
double tweight=doc1.DIG.findPhrasePathWeight(d2.getPhrase(t));
sigmak+=s2k*tweight;
}
double fragmentationFactor=1.2; //proposed constant
//find sigmap
double sigmap=0.0;
for(int t=0;t<doc1.getPhrases().get_nItems();t++)
{
double li=StringUtils.split(doc1.getPhrase(t)," ").length;
double si=doc1.getPhrases().get_nItems();
double gi=java.lang.Math.pow(li/si,fragmentationFactor);
double f1i=d1.findPhraseFrequency(doc1.getPhrase(t));
double w1i=doc1.DIG.findPhrasePathWeight(doc1.getPhrase(t));
double f2i=d2.findPhraseFrequency(doc1.getPhrase(t));
double w2i=doc1.DIG.findPhrasePathWeight(doc1.getPhrase(t));
double tsum=(f1i*w1i)+(f2i+w2i);
sigmap+=java.lang.Math.pow(gi*tsum,2.0);
}

//find sim_p
double simp=java.lang.Math.sqrt(sigmap);
simp/=(sigmaj+sigmak);
return(simp);
}
double findTermSimilarity(WebDocument d1,WebDocument d2)
{
WebDocument doc1=CombineDocument(d1,d2);
double sigma1=0.0;
double sigma21=0.0,sigma22=0.0;
for(int t=0;t<doc1.DIG.V.get_nItems();t++)
{
double tfidf1=findTFIDF(doc1.DIG.V.getItem(t),d1);
double tfidf2=findTFIDF(doc1.DIG.V.getItem(t),d2);
sigma1+=tfidf1*tfidf2;
sigma21+=tfidf1*tfidf1;
sigma22+=tfidf2*tfidf2;
}
//consine similarity
double simt=sigma1/java.lang.Math.sqrt(sigma21*sigma22);
return(simt);
}
double findTFIDF(String term,WebDocument d1)
{
//find tf

double n1=d1.findTermFrequency(term);
double tsum=0.0;
for(int t=0;t<d1.DIG.V.get_nItems();t++)
{
tsum+=d1.findTermFrequency(d1.DIG.V.getItem(t));
}
double tf=n1/tsum;
//find idf
int tDocCount=0;
for(int t=0;t<nDocuments;t++)
{
if(documents[t].DIG.V.isContains(term)==true)
{
tDocCount+=1;
}
}
double tval=(double)nDocuments/(double)tDocCount;
double idf=java.lang.Math.log(tval);
double tfidf=tf*idf;
return(tfidf);
}
WebDocument CombineDocument(WebDocument d1,WebDocument d2)
{
//construct combined doc to find matching phrases
DocumentIndexGraph dig1=new DocumentIndexGraph();
dig1.V.appendItemset(d1.DIG.V);
dig1.V.appendItemset(d2.DIG.V);

ItemsetCollection ic1=new ItemsetCollection(dig1.V);


dig1.V=ic1.getUniqueItemset();
dig1.E.appendItemsetCollection(d1.DIG.E);
dig1.E.appendItemsetCollection(d2.DIG.E);
ic1=dig1.E;
dig1.E=ic1.getUniqueItemsetCollection();
WebDocument doc1=new WebDocument();
doc1.setDIG(dig1);
doc1.Phrases.appendItemset(d1.getPhrases());
doc1.Phrases.appendItemset(d2.getPhrases());
ic1=new ItemsetCollection(doc1.getPhrases());
doc1.setPhrases(ic1.getUniqueItemset());
return(doc1);
}
void addResultText(String tStr)
{
txtResult.append(tStr);
txtResult.updateUI();
}
private void addVisitedPage(String tStr)
{
if(isVisitedPage(tStr)==false)
{
visitedPages[nVisited]=tStr;
nVisited++;
}
}
private boolean isVisitedPage(String tStr)

{
boolean visited=false;
for(int t=0;t<nVisited;t++)
{
if(tStr.compareToIgnoreCase(visitedPages[t])==0)
{
visited=true;
}
}
return(visited);
}
private void printVisitedPages()
{
logText+="visited:"+"\n";
for(int t=0;t<nVisited;t++)
{
logText+="["+visitedPages[t]+"]"+"\n";
}
}
static public void main(String[] args)
{
try {
UIManager.setLookAndFeel("com.sun.java.swing.plaf.windows.WindowsLookAndFeel");
} catch (Exception e) {
e.printStackTrace();
}
new Hier();

}
}
//Histogram
import java.awt.*;
import org.jfree.chart.*;
import org.jfree.chart.axis.*;
import org.jfree.chart.plot.*;
import org.jfree.chart.renderer.category.*;
import org.jfree.data.category.*;
import org.jfree.data.category.*;
import org.jfree.data.general.*;
import org.jfree.ui.*;
/**
* A simple demonstration application showing how to create a bar chart.
*
*/
public class Histogram extends ApplicationFrame {
/**
* Creates a new demo instance.
*
* @param title the frame title.
*/
public Histogram(final String title,double[][] sim) {
super(title);
final CategoryDataset dataset = createDataset(sim);
final JFreeChart chart = createChart(title,dataset);
final ChartPanel chartPanel = new ChartPanel(chart);

chartPanel.setPreferredSize(new Dimension(500, 270));


setContentPane(chartPanel);
}
/**
* Returns a sample dataset.
*
* @return The dataset.
*/
private CategoryDataset createDataset(double[][] sim) {
// create the dataset...
return DatasetUtilities.createCategoryDataset("","",sim);

}
/**
* Creates a sample chart.
*
* @param dataset the dataset.
*
* @return The chart.
*/
private JFreeChart createChart(String title,final CategoryDataset dataset) {
// create the chart...
final JFreeChart chart = ChartFactory.createBarChart(
title,
"",

// chart title
// domain axis label

"Similiarity",
dataset,

// range axis label


// data

PlotOrientation.VERTICAL, // orientation
true,

// include legend

true,

// tooltips?

false

// URLs?

);
// NOW DO SOME OPTIONAL CUSTOMISATION OF THE CHART...
// set the background color for the chart...
chart.setBackgroundPaint(Color.white);
// get a reference to the plot for further customisation...
final CategoryPlot plot = chart.getCategoryPlot();
plot.setBackgroundPaint(Color.lightGray);
plot.setDomainGridlinePaint(Color.white);
plot.setRangeGridlinePaint(Color.white);
// set the range axis to display integers only...
final NumberAxis rangeAxis = (NumberAxis) plot.getRangeAxis();
rangeAxis.setStandardTickUnits(NumberAxis.createIntegerTickUnits());
// disable bar outlines...
final BarRenderer renderer = (BarRenderer) plot.getRenderer();
renderer.setDrawBarOutline(false);
// set up gradient paints for series...
final GradientPaint gp0 = new GradientPaint(
0.0f, 0.0f, Color.blue,
0.0f, 0.0f, Color.lightGray

);
final GradientPaint gp1 = new GradientPaint(
0.0f, 0.0f, Color.green,
0.0f, 0.0f, Color.lightGray
);
final GradientPaint gp2 = new GradientPaint(
0.0f, 0.0f, Color.red,
0.0f, 0.0f, Color.lightGray
);
renderer.setSeriesPaint(0, gp0);
renderer.setSeriesPaint(1, gp1);
renderer.setSeriesPaint(2, gp2);
final CategoryAxis domainAxis = plot.getDomainAxis();
domainAxis.setCategoryLabelPositions(
CategoryLabelPositions.createUpRotationLabelPositions(Math.PI / 6.0)
);
// OPTIONAL CUSTOMISATION COMPLETED.
return chart;
}
//
****************************************************************************
// * JFREECHART DEVELOPER GUIDE

// * The JFreeChart Developer Guide, written by David Gilbert, is available *


// * to purchase from Object Refinery Limited:
// *

*
*

// * http://www.object-refinery.com/jfreechart/guide.html
// *

// * Sales are used to provide funding for the JFreeChart project - please
// * support us so that we can continue developing free software.

*
*

//
****************************************************************************
/**
* Starting point for the demonstration application.
*
* @param args ignored.
*/
/*public static void main(final String[] args) {
final Histogram demo = new Histogram("Bar Chart Demo");
demo.pack();
RefineryUtilities.centerFrameOnScreen(demo);
demo.setVisible(true);
}*/
}
//ItemsetCollection.java
import java.lang.*;
import java.io.*;
import java.util.*;
////////////////////ItemsetCollection class
class ItemsetCollection
{
ArrayList Itemsets;

static boolean printStatus=false;


//constructors
public ItemsetCollection()
{
Itemsets=new ArrayList();
}
public ItemsetCollection(Itemset tItemset)
{
Itemsets=new ArrayList();
Itemsets.add(tItemset);
}
public ItemsetCollection(String[] tarr)
{
Itemsets=new ArrayList();
for(int t=0;t<tarr.length;t++)
{
Itemsets.add(new Itemset(tarr[t]));
}
}
//get functions
public int get_nItemsets()
{
return(Itemsets.size());
}
public Itemset getItemset(int tIndex)
{

Itemset tItemset=new Itemset();


if(tIndex>=0&&tIndex<=Itemsets.size()-1)
{
tItemset=(Itemset)Itemsets.get(tIndex);
}
return(tItemset);
}
//set functions
public void setItemsets(ItemsetCollection tItemsetCollection)
{
clear();
for(int t=0;t<=tItemsetCollection.get_nItemsets()-1;t++)
{
addItemset(tItemsetCollection.getItemset(t));
}
}
//methods
public void addItemset(Itemset tItemset)
{
Itemset i1=new Itemset();
for(int t=0;t<tItemset.get_nItems();t++) i1.addItem(tItemset.getItem(t));
Itemsets.add(i1);
}
public void appendItemsetCollection(ItemsetCollection tItemsetCollection)
{
int t;

for(t=0;t<=tItemsetCollection.get_nItemsets()-1;t++)
{
addItemset(tItemsetCollection.getItemset(t));
}
}
public void removeItemset(Itemset tItemset)
{
for(int i=0;i<=Itemsets.size()-1;i++)
{
if(getItemset(i).isEquals(tItemset)==true)
{
Itemsets.remove(i);
break;
}
}
}
public void removeItemset(int tIndex)
{
if(tIndex>=0&&tIndex<=Itemsets.size()-1)
{
removeItemset(getItemset(tIndex));
}
}
public void removeItemsetCollection(ItemsetCollection tItemsetCollection)
{
for(int t=0;t<=tItemsetCollection.get_nItemsets()-1;t++)
{
removeItemset(tItemsetCollection.getItemset(t));

}
}
public void removeEmptyItemsets()
{
for(int t=0;t<=Itemsets.size()-1;t++)
{
if(getItemset(t).get_nItems()==0)
{
removeItemset(t);
}
}
}
public void clear()
{
Itemsets.clear();
}
public Itemset getUniqueItemset()
{
Itemset tItemset=new Itemset();
for(int i=0;i<=Itemsets.size()-1;i++)
{
for(int j=0;j<=getItemset(i).get_nItems()-1;j++)
{
if(tItemset.isContains(getItemset(i).getItem(j))==false)
{
tItemset.addItem(getItemset(i).getItem(j));
}

}
}
return(tItemset);
}
public ItemsetCollection getUniqueItemsetCollection()
{
ItemsetCollection ic1=new ItemsetCollection();
for(int i=0;i<=Itemsets.size()-1;i++)
{
if(ic1.isContains(getItemset(i))==false)
{
ic1.addItemset(getItemset(i));
}
}
return(ic1);
}
public double getSupport(String tItem)
{
int t,tCount=0;
double tSupport;
for(t=0;t<=Itemsets.size()-1;t++)
{
if(getItemset(t).isContains(tItem)==true)
{
tCount=tCount+1;

}
}
tSupport=((double)tCount/(double)Itemsets.size())*100.0;
tSupport=Math.round(tSupport);
return(tSupport);
}
public double getSupport(Itemset tItemset)
{
int t,tCount=0;
double tSupport;
for(t=0;t<=Itemsets.size()-1;t++)
{
if(getItemset(t).isContains(tItemset)==true)
{
tCount=tCount+1;
}
}
tSupport=((double)tCount/(double)Itemsets.size())*100.0;
tSupport=Math.round(tSupport);
return(tSupport);
}
public int getSupportCount(Itemset tItemset)
{
int t,tCount=0;
for(t=0;t<=Itemsets.size()-1;t++)

{
if(getItemset(t).isContains(tItemset)==true)
{
tCount=tCount+1;
}
}
return(tCount);
}
public boolean isContains(Itemset tItemset)
{
boolean found=false;
for(int t=0;t<=Itemsets.size()-1;t++)
{
if(getItemset(t).isContains(tItemset)==true)
{
found=true;
break;
}
}
return(found);
}
public String toString()
{
String tStr="";
for(int t=0;t<=Itemsets.size()-1;t++)

{
tStr=tStr+getItemset(t).toString()+"\n\r\n\r";
if(printStatus==true)
{
System.out.print(t+" transactions, "+(tStr.length()/1024)+"k...\r");
}
}
return(tStr);
}
public String toString1()
{
String tStr="";
for(int t=0;t<=Itemsets.size()-1;t++)
{
tStr=tStr+getItemset(t).toString()+"\n";
if(printStatus==true)
{
System.out.print(t+" transactions, "+(tStr.length()/1024)+"k...\r");
}
}
return(tStr);
}
}

//WebPageRetrieval.java
import java.io.*;
import java.net.*;
class WebPageRetrieval
{
public static void openWebpage(String tstrURL) throws Exception
{
URL target=new URL(tstrURL);
URLConnection con=target.openConnection();
byte b[]=new byte[1028];
int n=0;
System.out.println("Reading: ["+tstrURL+"]:");
BufferedInputStream in=new BufferedInputStream(con.getInputStream(),8080);
while((n=in.read(b,0,1024))!=-1)
{
System.out.println(new String(b,0,0,n));
}
System.out.println("\nContentType: "+con.getContentType());
System.out.println("ContentLength: "+con.getContentLength());
}
public static void main(String args[]) throws Exception
{
openWebpage("http://www.yahoo.com/");
}
}

Screen shots

TESTING
Testing is a process of executing a program with the intent of finding an error. A good test
case is one that has a high probability of finding an as-yet undiscovered error. A successful test
is one that uncovers an as-yet- undiscovered error. System testing is the stage of implementation,
which is aimed at ensuring that the system works accurately and efficiently as expected before
live operation commences. It verifies that the whole set of programs hang together. System
testing requires a test consists of several key activities and steps for run program, string, system
and is important in adopting a successful new system. This is the last chance to detect and correct
errors before the system is installed for user acceptance testing.
The software testing process commences once the program is created and the
documentation and related data structures are designed. Software testing is essential for
correcting errors. Otherwise the program or the project is not said to be complete. Software
testing is the critical element of software quality assurance and represents the ultimate the review
of specification design and coding. Testing is the process of executing the program with the
intent of finding the error. A good test case design is one that as a probability of finding a yet
undiscovered error. A successful test is one that uncovers a yet undiscovered error. Any
engineering product can be tested in one of the two ways:
6.1 Unit Tesing:
6.1.1 White Box Testing:
This testing is also called as Glass box testing. In this testing, by knowing the specific
functions that a product has been design to perform test can be conducted that demonstrate each
function is fully operational at the same time searching for errors in each function. It is a test
case design method that uses the control structure of the procedural design to derive test cases.
Basis path testing is a white box testing.
Basis path testing:
Flow graph notation
Cyclometric complexity

Deriving test cases


6.1.2 Black Box Testing:
In this testing by knowing the internal operation of a product, test can be conducted
to ensure that all gears mesh, that is the internal operation performs according to specification
and all internal components have been adequately exercised. It fundamentally focuses on the
functional requirements of the software.
The steps involved in black box test case design are:

Graph based testing methods

Equivalence partitioning

Boundary value analysis

Comparison testing

6.1.3 Test Case Specifications:

Testcase
number

Testcase
Select file

Input
File name

Expected
output

Obtained
output

Started process Give result


and
devide
histograms

CONCLUSIONS AND FUTIRE WORK


In this paper we propose a Multiviewpoint-based Similarity measuring method, named MVS.
Theoretical analysis and empirical examples show that MVS is potentially more suitable for text
documents than the popular cosine similarity. Based on MVS, two criterion functions, IR and
IV , and their respective clustering algorithms, MVSC-IR and MVSC-IV , have been introduced.
Compared with other state-of-the-art clustering methods that use different types of similarity
measure, on a large number of document data sets and under different evaluation metrics, the
proposed algorithms show that they could provide significantly improved clustering
performance. The key contribution of this paper is the fundamental concept of similarity measure
from multiple viewpoints. Future methods could make use of the same principle, but define
alternative forms for the relative similarity in (10), or do not use average but have other methods
to combine the relative similarities according to the different viewpoints. Besides, this paper
focuses on partitioned clustering of documents. In the future, it would also be possible to apply
the proposed criterion functions for hierarchical clustering algorithms. Finally, we have shown
the application of MVS and its clustering algorithms for text data. It would be interesting to
explore how they work on other types of sparse and high-dimensional data.

REFERENCES
1. X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J.McLachlan, A. Ng, B.
Liu, P.S. Yu, Z.-H. Zhou, M. Steinbach, D.J.Hand, and D. Steinberg, Top 10 Algorithms in Data
Mining, Knowledge Information Systems, vol. 14, no. 1, pp. 1-37, 2007.
2. I. Guyon, U.V. Luxburg, and R.C. Williamson, Clustering:Science or Art?, Proc. NIPS
Workshop Clustering Theory, 2009.
3.

Dhillon and D. Modha, Concept Decompositions for Large Sparse Text Data Using

Clustering, Machine Learning, vol. 42,nos. 1/2, pp. 143-175, Jan. 2001.
4. S. Zhong, Efficient Online Spherical K-means Clustering, Proc.IEEE Intl Joint Conf.
Neural Networks (IJCNN), pp. 3180-3185, 2005.
5. A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, Clustering with Bregman Divergences, J.
Machine Learning Research, vol. 6,pp. 1705-1749, Oct. 2005.
6. E. Pekalska, A. Harol, R.P.W. Duin, B. Spillmann, and H. Bunke,Non-Euclidean or NonMetric Measures Can Be Informative,Structural, Syntactic, and Statistical Pattern Recognition,
vol. 4109,pp. 871-880, 2006.
7. M. Pelillo, What Is a Cluster? Perspectives from Game Theory,Proc. NIPS Workshop
Clustering Theory, 2009.
8. D. Lee and J. Lee, Dynamic Dissimilarity Measure for Support Based Clustering, IEEE
Trans. Knowledge and Data Eng., vol. 22,no. 6, pp. 900-905, June 2010.
9. A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, Clustering on the Unit Hypersphere Using Von
Mises-Fisher Distributions,J. Machine Learning Research, vol. 6, pp. 1345-1382, Sept. 2005.
10. W. Xu, X. Liu, and Y. Gong, Document Clustering Based on Non-Negative Matrix
Factorization, Proc. 26th Ann. Intl ACM SIGIR Conf. Research and Development in
Informaion Retrieval, pp. 267-273,2003.
11. I.S. Dhillon, S. Mallela, and D.S. Modha, Information-Theoretic Co-Clustering, Proc.
Ninth ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD), pp. 89-98,
2003.

12. C.D. Manning, P. Raghavan, and H. Schu tze, An Introduction to Information Retrieval.
Cambridge Univ. Press, 2009.
13. C. Ding, X. He, H. Zha, M. Gu, and H. Simon, A Min-Max Cut Algorithm for Graph
Partitioning and Data Clustering, Proc.IEEE Intl Conf. Data Mining (ICDM), pp. 107-114,
2001.
14.

H. Zha, X. He, C. Ding, H. Simon, and M. Gu, Spectral Relaxation for K-Means

Clustering, Proc. Neural Info. Processing Systems (NIPS), pp. 1057-1064, 2001.
15. J. Shi and J. Malik, Normalized Cuts and Image Segmentation,IEEE Trans. Pattern
Analysis Machine Intelligence, vol. 22, no. 8,pp. 888-905, Aug. 2000.
16.

I.S. Dhillon, Co-Clustering Documents and Words Using Bipartite Spectral Graph

Partitioning, Proc. Seventh ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining
(KDD),pp. 269-274, 2001.
17. Y. Gong and W. Xu, Machine Learning for Multimedia Content Analysis. Springer-Verlag,
2007.
18. Y. Zhao and G. Karypis, Empirical and Theoretical Comparisons of Selected Criterion
Functions for Document Clustering,Machine Learning, vol. 55, no. 3, pp. 311-331, June 2004.
19. G. Karypis, CLUTO a Clustering Toolkit, technical report, Dept.of Computer Science,
Univ. of Minnesota, http://glaros.dtc.umn.edu/~gkhome/views/cluto, 2003.
20.

A. Strehl, J. Ghosh, and R. Mooney, Impact of Similarity Measures on Web-Page

Clustering, Proc. 17th Natl Conf. Artificial Intelligence: Workshop of Artificial Intelligence for
Web Search (AAAI),pp. 58-64, July 2000.

Vous aimerez peut-être aussi