Rintro Wekacomplete

INTRODUCTION TO WEKA:
Weka contains a collection of visualization tools and algorithms for data

analysis and predictive modeling, together with graphical user interfaces for easy access to
these functions.[1] The original non-Java version of Weka was a Tcl/Tk front-end to (mostly
third-party) modeling algorithms implemented in other programming languages, plus data
preprocessingutilities in C, and a Makefile-based system for running machine learning
experiments. This original version was primarily designed as a tool for analyzing data from
agricultural domains,[2][3] but the more recent fully Java-based version (Weka 3), for which
development started in 1997, is now used in many different application areas, in particular for
educational purposes and research. Advantages of Weka include:
● Free availability under the GNU General Public License.
● Portability, since it is fully implemented in the Java programming language and thus runs
on almost any modern computing platform.
● A comprehensive collection of data preprocessing and modeling techniques.
● Ease of use due to its graphical user interfaces.

Weka supports several standard data mining tasks, more specifically, data
preprocessing, clustering, classification, regression, visualization, and feature selection. All
of Weka's techniques are predicated on the assumption that the data is available as one flat
file or relation, where each data point is described by a fixed number of attributes (normally,
numeric or nominal attributes, but some other attribute types are also supported). Weka
provides access to SQL databases using Java Database Connectivity and can process the
result returned by a database query. Weka provides access to deep
learning with Deeplearning4j.[4] It is not capable of multi-relational data mining, but there is
separate software for converting a collection of linked database tables into a single table that
is suitable for processing using Weka.[5] Another important area that is currently not covered
by the algorithms included in the Weka distribution is sequence modeling.
Native Regression Tools:

Weka has a large number of regression and classification tools. Native packages are the ones
included in the executible Weka software, while other non-native ones can be downloaded
and used within R.Weka environment. Among the native packages, the most famous tool is
the M5p model tree package. The full list of tools is available here. Some of the regression
tools are:
● M5Rules (M5' algorithm presented in terms of mathematical function without a tree)
● DecisionStump (same as M5' but with a single number output in each node)
● M5P (splitting domain into successive binary regions and then fit linear models to each
tree node)
● RandomForest (several model trees combined)

● RepTree (several model trees combined)
● ZeroR (the average value of outputs)
● DecisionRules (splits data into several regions based on a single independent variable and
provides a single output value for each range)
● LinearRegression
● SMOreg (support vector regression)
● SimpleLinearRegression (uses an intercept and only 1 input variable for multivariate

data)
● MultiLayerPerceptron (neural network)
User Interfaces:
Weka's main user interface is the Explorer, but essentially the same functionality can be
accessed through the component-based Knowledge Flow interface and from the command
line. There is also the Experimenter, which allows the systematic comparison of the
predictive performance of Weka's machine learning algorithms on a collection of datasets.
The Explorer interface features several panels providing access to the main components of
the workbench:
● The Preprocess panel has facilities for importing data from a database, a comma-
separated values (CSV) file, etc., and for preprocessing this data using a so-
called filteringalgorithm. These filters can be used to transform the data (e.g., turning
numeric attributes into discrete ones) and make it possible to delete instances and
attributes according to specific criteria.
● The Classify panel enables applying classification and regression algorithms

(indiscriminately called classifiers in Weka) to the resulting dataset, to estimate
the accuracy of the resulting predictive model, and to visualize erroneous
predictions, receiver operating characteristic (ROC) curves, etc., or the model itself (if the
model is amenable to visualization like, e.g., a decision tree).
● The Associate panel provides access to association rule learners that attempt to identify
all important interrelationships between attributes in the data.
● The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-
means algorithm. There is also an implementation of the expectation maximization
algorithm for learning a mixture of normal distributions.
● The Select attributes panel provides algorithms for identifying the most predictive
attributes in a dataset.
● The Visualize panel shows a scatter plot matrix, where individual scatter plots can be
selected and enlarged, and analyzed further using various selection operators.
1. Aim:
Create a ARFF(Attribute-Relation File Format) file and read it in WEKA . Explore the
purpose of each button under the preprocess panel after loading the ARFF file . Also , try to
interpret using a different ARFF file , weather.arff , provided with WEKA .
Data Types:
1. Numeric
2. Nominal (one of the predefined list of values define along a specification listing of possible
value {n1,n2,………….ni} ).
3. To create an attribute containing arbitrary textual value .
4. Date must be specified in data section
Default format “YYYY-MM-DD”
“ T “ followed “ HH:MM:SS “ this followed by - …………………………
Data Section:
@data
Start of the data segment in the file. An instance is represented in a single line with a carriage
return . Attribute values are delimited by commas or spaces . They must appear in the order
that are declared in header section . Missing values are represented with the Question mark
(?) .
Creation Of ARFF File:

Method 1:
1.Create a file manually in the notepad and save the file in an extension arff.
2.Open WEKA Explorer .
3.Select preprocess and open file.arff and file type (all files*).
Method 2:
a) Open an Excel file filename.xls.
b) Save as file name.CSV and file type as CSV(delimited).
c) Open the filename.CSV with MS-WORD and type the program.
d) Save the file as file.arff and file type as plain text.
e) An ARFF will be created and it can be opened with WEKA Explorer.
Example:
@ relation student
@attribute stdid real
@attribute stdname string
@attribute gender { female , male}
@attribute branch { CSSE ,CSIT ,CSNW}
@data
1 .lavanya female CSSE
2. suma female CSIT
3.yoshitha female CSNW
Output:
Example 2:
Weather Dataset Using Nominal Attributes :
@attribute weather.symbolic
@ attribute outlook {sunny,overcast,rainy}
@ attribute temperature{hot,mild,cool}
@ attribute humidity {high,normal}
@ attribute windy {TRUE ,FALSE}
@ attribute play{yes,no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,yes
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
Output:
Weather Dataset Using Numeric Attributes :
@ relation weather
@ attribute outlook {sunny,overcast,rainy}
@ attribute temperature numeric
@ attribute humidity numeric
@ attribute windy {TRUE ,FALSE}
@ attribute play{yes,no}
@data
sunny,85,85,FALSE,no
sunny,80 ,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,,FALSE,no
sunny,69,70,,FALSE,yes
rainy,75,80,,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
Output:
2. Aim:
Performing Data preprocessing in WEKA
Study Unsupervised Attribute Filters such as Replace Missing Values to replace missing
values in the given dataset, Add to add the new attribute Average, Discretize to descritize the
attributes into bins. Explore Normalize and Standardize options on a dataset with numerical
attributes.
Introduction:
The first four buttons at the top of the preprocess section enable you to load data into WEKA:
1. Open file :
Brings up a dialog box allowing you to browse for the data file on the local
file system.
2. Open URL:
Asks for a Uniform Resource Locator address for where the data is stored.
3. Open DB:
Reads data from a database. (Note that to make this work you might have to
edit the file in weka/experiment/DatabaseUtils.props.)
4. Generate.... Enables you to generate artificial data from a variety of DataGenerators.
Unsupervised attribute filters:

Unsupervised Attribute Filters such as Replace Missing Values to replace missing values in
the given dataset can be handled as follows
Instances with missing values do not have to be removed, you can replace the missing values
with some other value.
This is called imputing missing values.
It is common to impute missing values with the mean of the numerical distribution. You can
do this easily in WEKA using the Replace Missing Values filter.
you can impute the missing values as follows:
1. Click the “Choose” button for the Filter and select ReplaceMissingValues, it us under
unsupervized.attribute.ReplaceMissingValues.
2. Click the “Apply” button to apply the filter to your dataset.
Click “mass” in the “attributes” section and review the details of the “selected attribute”.
Notice that the attribute values that were marked Missing have been set to the mean value of
the distribution.
Discretization:
It is the Process of converting continues valued attributes into
discrete variables with small number of values. Data discretization techniques can be used to
reduce the number of values for a given continuous attribute by dividing the range of the
attribute into intervals. Interval labels can then be used to replace actual data values. This
leads to a concise, easy-to-use, knowledge-level representation of mining results. Data
discretization can perform before or while doing data mining. Most of the real data set
usually contains continuous attributes. Some machine learning algorithms that can handle
both continuous and discrete attributes perform better with discrete-valued attributes.
Discretization involves:
1. Divide the ranges of continuous attribute into intervals
2. Some classification algorithms only accept categorical attributes
3. Reduce data size by discretization
4. Prepare for further analysis
Output:
Normalization:
It is the process of converting data into scaled data so as to fall within small specified range
either -1.0 to 1.0 or 0.0 to 1.0 .Normalization seems to refer to the dividing of a vector by its
length .You can normalize all of the attributes in your dataset with Weka by choosing
the Normalize filter and applying it to your dataset.
Formula to calculate normalization
Xnew = (X- Xmin)/(Xmax-Xmin)
Output:
Standardization:
It is the process of standadizing all numerical values in given data set to have zero mean and
unit variance. You can standardizeall of the attributes in your dataset with Weka by
choosing the Standardize filter and applying it your dataset.
Standardization will transform features so that,it to have zero mean and unit variance.
Formula to calculate standardization,
Xnew = x-mean/std
what type of algorithm benefits from standardization and what not.
Most algorithms will probably benefit from standardization more so than from normalization.
Output:
Procedure:
1. Create an ARFF file
2. Open WEKA and select knowledge flow layout from the application.
3.Bring ARFF LOADER from source to knowledge layout and name.
4.Bring discretized , normalised and standardised to knowledge flow layout from filters.
5.Bring three ARFF savers from "DATA SINKS" to knowledge flow layout.
6.Configure ARFF loader and three ARFF servers .
7.Link ARFF Loader to discretize standardize and normalized(data set).
8.Link discretized , standardized , normalized to ARFF servers.
9.Start loading from ARFF savers.

3. Aim:
Classification using the WEKA toolkit
Classification:
Classification is the idea to predict the target class by analysis on the
training dataset.This could be done by finding proper boundaries for each target class.In a
general way of saying,use the training dataset to get better boundary conditions which could
be used to determine each target class.Once the boundary conditions determined,the next task
is to predict the target class as we have said earlier.The whole process is known as
classification.
i)Demonstration of classification process using ID3 algorithm on categorical
dataset(weather):
ID3 Algorithm:
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross
Quinlan used to generate a decision tree from the dataset. To model the classification process,
a tree is constructed using the decision tree technique. Once a tree is built, it is applied to
each tuple in the database and this results in classification for that tuple.
The following issues are faced by most decision tree algorithms:
• To choose splitting attributes
• Order of splitting attributes
• Number of splits to be taken
• Balance of tree structure and pruning
• The stopping criteria
The decision tree algorithm is based on Entropy, its main idea is to map all examples to
different categories based upon different values of the condition attribute set; its core is to
determine the best classification attribute from condition attribute sets. The algorithm chooses
information gain as attribute selection criteria; usually the attribute that has the highest
information gain is selected as the splitting attribute of the current node.
Advantages of ID3:
1. Understandable prediction rules are created from the training data.
2. Builds the fastest tree.
3. Builds a short tree.
4. Only need to test enough attributes until all data is classified.
5. Finding leaf nodes enables test data to be pruned, reducing number of tests.
6. Whole dataset is searched to create tree.
Procedure:
STEP 1: Choose the data file required to classify.
STEP 2: Then select classify option in the menu.
STEP 3: Choose the classifying algorithm
STEP 4: Select the id3 algorithm in that select 2 Fold and then click Start.
STEP 5: Then the data file is processed and the resultant is displayed in classifier output.
Output:
=== Run information ===
Scheme: weka.classifiers.trees.Id3
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Id3
outlook = sunny
| humidity = high: no
| humidity = normal: yes
outlook = overcast: yes
outlook = rainy
| windy = TRUE: no
| windy = FALSE: yes
Time taken to build model: 0.02 seconds
=== Stratified cross-validation ===

=== Summary ===
Correctly Classified Instances 12 85.7143 %

Incorrectly Classified Instances 2 14.2857 %
Kappa statistic 0.6889
Mean absolute error 0.1429
Root mean squared error 0.378
Relative absolute error 30 %
Root relative squared error 76.6097 %
Total Number of Instances 14
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.889 0.200 0.889 0.889 0.889 0.689 0.844 0.862 yes
0.800 0.111 0.800 0.800 0.800 0.689 0.844 0.711 no
Weighted Avg. 0.857 0.168 0.857 0.857 0.857 0.689 0.844 0.808
=== Confusion Matrix ===
a b <-- classified as
8 1 | a = yes
1 4 | b = no
ii)Demonstration of classification process using naive bayes algorithm on categorical
dataset(vote):
Naive Bayes Algorithm:

It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors.In simple terms, a Naive Baeyes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature.For
example ,a fruit may be considered to be an apple if it is red,round,and about 3 inches in
diameter.Even if it these features depend on each other or upon the existence of the other
features,all of these properties independently contribute to the probability that this fruit is an
apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data
sets.Along with simplicity,NaiveBaeyes is known to outperform even highly sophisticated
classification methods.
Advantages of Naive Bayes:
➢ It is easy to predict class of test data set.It also performs well in multi class prediction.
➢ When assumption of independence holds a Naive Bayes classifier performs better
compared to other models like logistic regression and you need less training data.
➢ It performs well in case of categorical input variables compared to numeric
variables.For numeric variables,nominal distribution is assumed.
Procedure:
STEP 4: Select the Naïve Bayes algorithm in that select 2 Fold and then click Start.
Output:
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: vote
Instances: 435
Attributes: 17
handicapped-infants
water-project-cost-sharing
adoption-of-the-budget-resolution
physician-fee-freeze
el-salvador-aid
religious-groups-in-schools
anti-satellite-test-ban
aid-to-nicaraguan-contras
mx-missile
immigration
synfuels-corporation-cutback
education-spending
superfund-right-to-sue
crime
duty-free-exports
export-administration-act-south-africa
Class
Naive Bayes Classifier
Class
Attribute democrat republican
(0.61) (0.39)
===============================================================
handicapped-infants
n 103.0 135.0
y 157.0 32.0
[total] 260.0 167.0
water-project-cost-sharing
n 120.0 74.0
y 121.0 76.0
[total] 241.0 150.0
adoption-of-the-budget-resolution
n 30.0 143.0
y 232.0 23.0
[total] 262.0 166.0
physician-fee-freeze
n 246.0 3.0
y 15.0 164.0
[total] 261.0 167.0
el-salvador-aid
n 201.0 9.0
y 56.0 158.0
[total] 257.0 167.0
religious-groups-in-schools
n 136.0 18.0
y 124.0 150.0
[total] 260.0 168.0
anti-satellite-test-ban
n 60.0 124.0
y 201.0 40.0
[total] 261.0 164.0
aid-to-nicaraguan-contras
n 46.0 134.0
y 219.0 25.0
[total] 265.0 159.0
mx-missile
n 61.0 147.0
y 189.0 20.0
[total] 250.0 167.0
immigration
n 140.0 74.0
y 125.0 93.0
[total] 265.0 167.0
synfuels-corporation-cutback
n 127.0 139.0
y 130.0 22.0
[total] 257.0 161.0
education-spending
n 214.0 21.0
y 37.0 136.0
[total] 251.0 157.0
superfund-right-to-sue
n 180.0 23.0
y 74.0 137.0
[total] 254.0 160.0
crime
n 168.0 4.0
y 91.0 159.0
[total] 259.0 163.0
duty-free-exports
n 92.0 143.0
y 161.0 15.0
[total] 253.0 158.0
export-administration-act-south-africa
n 13.0 51.0
y 174.0 97.0
[total] 187.0 148.0

=== Summary ===

Relative absolute error 20.9815 %
Class
0.891 0.083 0.944 0.891 0.917 0.797 0.973 0.984 democrat
0.917 0.109 0.842 0.917 0.877 0.797 0.973 0.957 republican
Weighted Avg. 0.901 0.093 0.905 0.901 0.902 0.797 0.973 0.973
238 29 | a = democrat
14 154 | b = republican
iii)Demonstration of classification process using Random Forest algorithm on datasets
containing large number of attributes(diabetes):
Random Forest Algorithm:

Random Forest is a flexible, easy to use machine learning algorithm that
produces, even without hyper-parameter tuning, a great result most of the time. It is
also one of the most used algorithms, because it’s simplicity and the fact that it can be
used for both classification and regression tasks.Random Forest is a supervised learning
algorithm.
.Random Forest has nearly the same hyperparameters as a decision tree or a bagging
classifier. Fortunately, you don’t have to combine a decision tree with a bagging classifier
and can just easily use the classifier-class of Random Forest. Like I already said, with
Random Forest, you can also deal with Regression tasks by using the Random Forest
regressor.Random Forest adds additional randomness to the model, while growing the trees.
Instead of searching for the most important feature while splitting a node, it searches for the
best feature among a random subset of features. This results in a wide diversity that generally
results in a better model.
Advantages of Random Forest:
➢ It is one of the most accurate learning algorithms available. For many data sets, it
produces a highly accurate classifier.
➢ It runs efficiently on large databases.
➢ It can handle thousands of input variables without variable deletion.
➢ It gives estimates of what variables are important in the classification.
➢ It generates an internal unbiased estimate of the generalization error as the forest

building progresses.
➢ It has an effective method for estimating missing data and maintains accuracy when a
large proportion of the data are missing.
Procedure:
STEP 3: Choose the classifying algorithm.
STEP 4: Select the Random Forest algorithm in that select 2 Fold and then click
Start.
STEP 5: Then the data file is processed and the resultant is displayed in classifier
output.
Output:
Scheme: weka.classifiers.trees.RandomForest -P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V

0.001 -S 1
Relation: pima_diabetes
Instances: 768
Attributes: 9
preg
plas
pres
skin
insu
mass
pedi
age
class
RandomForest
Bagging with 100 iterations and base learner
weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities

=== Summary ===

Class
0.836 0.388 0.801 0.836 0.818 0.458 0.820 0.886
tested_negative
0.612 0.164 0.667 0.612 0.638 0.458 0.820 0.679
tested_positive
Weighted Avg. 0.758 0.310 0.754 0.758 0.755 0.458 0.820 0.814
418 82 | a = tested_negative
104 164 | b = tested_positive
4. Aim:
To demonstrate the classification process using J48 algorithm on mixed type of dataset after
discretizing numeric attributes. To perform cross-validation strategy with various fold levels.
Compare the accuracy of results.
J48 Algorithm:
A decision tree is a predictive machine-learning model that decides the target value of a new
sample based on various attribute values of the available data. The internal nodes of a
decision tree denote the different attributes, the branches between the nodes tell us the
possible values that these attributes can have in the observed samples, while the terminal
nodes tell us the final value of the dependent variable.
The J48 Decision tree classifier follows the following simple algorithm. In order to classify a
new item, it first needs to create a decision tree based on the attribute values of the available
training data. So, whenever it encounters a set of items it identifies the attribute that
discriminates the various instances most clearly. This feature that is able to tell us most about
the data instances so that we can classify them the best is said to have the highest information
gain. Now, among the possible values of this feature, if there is any value for which there is
no ambiguity, that is, for which the data instances falling within its category have the same
value for the target variable, then we terminate that branch and assign to it the target value
that we have obtained.
For the other cases, we then look for another attribute that gives us the highest information
gain. Hence we continue in this manner until we either get a clear decision of what
combination of attributes gives us a particular target value, or we run out of attributes. In the
event that we run out of attributes, or if we cannot get an unambiguous result from the
available information, we assign this branch a target value that the majority of the items
under this branch possess.
Now that we have the decision tree, we follow the order of attribute selection as we have
obtained for the tree. By checking all the respective attributes and their values with those seen
in the decision tree model, we can assign or predict the target value of this new instance.
Procedure:
STEP 4: Select the j48 algorithm in that select 2 Fold and then click Start.
STEP 10: Compare the three results to check for the accuracy.
Output:
2 Fold j48:
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
J48 pruned tree
petalwidth<= 0.6: Iris-setosa (50.0)
petalwidth> 0.6
| petalwidth<= 1.7
| | petallength<= 4.9: Iris-versicolor (48.0/1.0)
| | petallength> 4.9
| | | petalwidth<= 1.5: Iris-virginica (3.0)
| | | petalwidth> 1.5: Iris-versicolor (3.0/1.0)
| petalwidth> 1.7: Iris-virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
=== Summary ===
Kappa statistic 0.9
Class
0.980 0.000 1.000 0.980 0.990 0.985 0.990 0.987 Iris-setosa
0.900 0.050 0.900 0.900 0.900 0.850 0.925 0.843 Iris-versicolor
0.920 0.050 0.902 0.920 0.911 0.866 0.955 0.869 Iris-virginica
Weighted Avg. 0.933 0.033 0.934 0.933 0.934 0.900 0.957 0.900
a b c <-- classified as
49 1 0 | a = Iris-setosa
0 45 5 | b = Iris-versicolor
1 4 46 | c = Iris-virginica
4 fold j48:
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
J48 pruned tree
------------------
petalwidth> 0.6
| petalwidth<= 1.7
Time taken to build model: 0 seconds
=== Summary ===
Class
0.980 0.000 1.000 0.980 0.990 0.985 0.990 0.987 Iris-setosa
0.920 0.030 0.939 0.920 0.929 0.895 0.945 0.902 Iris-versicolor
0.960 0.040 0.923 0.960 0.941 0.911 0.968 0.893 Iris-virginica
Weighted Avg. 0.953 0.023 0.954 0.953 0.953 0.930 0.968 0.927
0 2 48 | c = Iris-virgin
6 fold j48:
== Run information ===
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
J48 pruned tree
------------------
petalwidth> 0.6
| petalwidth<= 1.7
Time taken to build model: 0 seconds
=== Summary ===
Correctly Classified Instances 144 96 %
Incorrectly Classified Instances 6 4 %
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.980 0.000 1.000 0.980 0.990 0.985 0.990 0.987 Iris-setosa
0.940 0.030 0.940 0.940 0.940 0.910 0.943 0.866 Iris-versicolor
0.960 0.030 0.941 0.960 0.950 0.925 0.957 0.893 Iris-virginica
Weighted Avg. 0.960 0.020 0.960 0.960 0.960 0.940 0.963 0.915
0 2 48 | c = Iris-virginica
The value of Iris-setosa is accurate.

The value of Iris-veriscolor is inaccurate.
The value of Iris-virginica is inaccurate.
5. Aim:
To apply HIERARCHICAL CLUSTERING Algorithm on numeric dataset and estimate
cluster quality.
Hierarchical Clustering Algorithm:

Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic
process of hierarchical clustering is this:
1. Start by assigning each item to a cluster, so that if you have N items, you now have N
clusters, each containing just one item. Let the distances (similarities) between the
clusters the same as the distances (similarities) between the items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so
that now you have one cluster less.
3. Compute distances (similarities) between the new cluster and each of the old clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)
Procedure:
STEP 1: Choose the data file required to cluster.
STEP 2: Then select cluster option in the menu.
STEP 3: Choose the clustering algorithm
STEP 4: Select the Hierarchical Clustering algorithm and then click Start.
STEP 5: Then the data file is processed and the resultant is displayed in cluster output.
Output:
Scheme: weka.clusterers.HierarchicalClusterer -N 2 -L SINGLE -P -A

"weka.core.EuclideanDistance -R first-last"
Instances: 768
Attributes: 9
preg
plas
pres
skin
insu
mass
pedi
age
class
Test mode: evaluate on training data
=== Clustering model (full training set) ===
Cluster 0
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((1.0:0.16243,1.0:0.16243):0.
01512,1.0:0.17755):0.01402,1.0:0.19157):0.00239,(1.0:0.1806,1.0:0.1806):0.01336):0.01194
,1.0:0.20589):0.00377,((1.0:0.18258,1.0:0.18258):0.02095,1.0:0.20353):0.00613):0.00999,1.
0:0.21966):0.00404,1.0:0.2237):0.00579,1.0:0.22949):0.00227,(((1.0:0.21571,(((1.0:0.16563,
1.0:0.16563):0.00997,1.0:0.17559):0.01344,1.0:0.18904):0.02667):0.00697,((1.0:0.17094,1.0
:0.17094):0.04365,(1.0:0.20524,1.0:0.20524):0.00935):0.00809):0.00018,1.0:0.22285):0.008
91):0.00051,1.0:0.23228):0.00138,1.0:0.23366):0.00118,(1.0:0.23269,1.0:0.23269):0.00215):
0.00789,((((((((((((((1.0:0.22168,((((((((((((((((((((((((((1.0:0.12408,1.0:0.12408):0.00497,1.0:
0.12905):0.02476,(((1.0:0.13849,((1.0:0.1028,1.0:0.1028):0.02957,1.0:0.13238):0.00611):0.0
0213,(((1.0:0.11811,1.0:0.11811):0.01056,1.0:0.12867):0.001,(1.0:0.08378,1.0:0.08378):0.04
588):0.01095):0.01107,1.0:0.15169):0.00212):0.00518,1.0:0.159):0.00847,1.0:0.16747):0.00
331,1.0:0.17078):0.00048,1.0:0.17126):0.00086,(1.0:0.16602,1.0:0.16602):0.0061):0.0004,((
(1.0:0.11636,1.0:0.11636):0.02165,((1.0:0.12112,1.0:0.12112):0.00847,1.0:0.12958):0.00843
):0.01179,1.0:0.1498):0.02271):0.00279,((1.0:0.11501,1.0:0.11501):0.05989,1.0:0.1749):0.00
04):0.00392,1.0:0.17923):0.00154,1.0:0.18076):0.00214,1.0:0.18291):0.00604,1.0:0.18895):
0.01162,1.0:0.20057):0.00176,1.0:0.20233):0.00113,(1.0:0.16712,(1.0:0.13515,1.0:0.13515):
0.03196):0.03634):0.00379,1.0:0.20724):0.0002,(1.0:0.17422,1.0:0.17422):0.03322):0.00088
,1.0:0.20832):0.0024,(1.0:0.20527,1.0:0.20527):0.00545):0.00097,(1.0:0.19412,1.0:0.19412):
0.01757):0.00077,1.0:0.21246):0.00035,1.0:0.21281):0.00054,1.0:0.21335):0.00376,1.0:0.21
711):0.00457):0.0002,((1.0:0.17824,1.0:0.17824):0.0313,((1.0:0.09973,1.0:0.09973):0.02831
,1.0:0.12804):0.0815):0.01233):0.00033,((((((((1.0:0.14403,1.0:0.14403):0.00954,1.0:0.1535
8):0.06106,(((1.0:0.18043,1.0:0.18043):0.01027,1.0:0.19069):0.01712,1.0:0.20781):0.00683)
:0.00281,1.0:0.21745):0.00093,(((((((((((((((((((1.0:0.1229,1.0:0.1229):0.00673,1.0:0.12963):
0.03598,((1.0:0.11006,1.0:0.11006):0.03574,1.0:0.1458):0.01982):0.00203,1.0:0.16764):0.01
06,1.0:0.17824):0.00232,1.0:0.18056):0.00507,1.0:0.18564):0.00202,1.0:0.18766):0.00417,1.
0:0.19183):0.0028,1.0:0.19464):0.00041,((1.0:0.17495,((1.0:0.15996,1.0:0.15996):0.01258,1.
0:0.17254):0.00241):0.00264,1.0:0.17759):0.01746):0.00053,1.0:0.19558):0.00449,1.0:0.200
07):0.00081,((((((((1.0:0.17903,((((1.0:0.11301,(1.0:0.07607,1.0:0.07607):0.03693):0.01799,
1.0:0.131):0.01942,1.0:0.15042):0.02059,(1.0:0.16646,(1.0:0.1515,1.0:0.1515):0.01496):0.00
454):0.00803):0.0001,(1.0:0.17022,1.0:0.17022):0.00892):0.00272,1.0:0.18185):0.00563,1.0:
0.18748):0.00047,1.0:0.18795):0.00054,1.0:0.18848):0.00532,1.0:0.1938):0.00502,1.0:0.198
82):0.00206):0.00008,1.0:0.20096):0.00054,1.0:0.2015):0.00146,1.0:0.20295):0.00583,((1.0:
0.20043,1.0:0.20043):0.00293,1.0:0.20336):0.00542):0.00872,(((1.0:0.13437,1.0:0.13437):0.
0618,1.0:0.19618):0.0125,1.0:0.20868):0.00883):0.00088):0.00114,1.0:0.21952):0.00162,1.0
:0.22114):0.00026,1.0:0.2214):0.00081):0.00043,((1.0:0.15582,1.0:0.15582):0.02883,1.0:0.1
8465):0.038):0.00241,1.0:0.22506):0.00453,1.0:0.22959):0.00062,((1.0:0.21279,1.0:0.21279)
:0.00162,1.0:0.2144):0.01581):0.00001,(1.0:0.18514,1.0:0.18514):0.04508):0.0001,1.0:0.230
32):0.00121,1.0:0.23153):0.00181,1.0:0.23335):0.00203,1.0:0.23538):0.00013,((1.0:0.18727,
1.0:0.18727):0.0067,1.0:0.19397):0.04154):0.0006,1.0:0.23612):0.00661):0.0001,1.0:0.2428
2):0.00008,(1.0:0.22005,1.0:0.22005):0.02286):0.00027,(1.0:0.23415,1.0:0.23415):0.00903):
0.00018,(1.0:0.24142,(((1.0:0.23255,((1.0:0.21728,((1.0:0.20417,1.0:0.20417):0.01296,(1.0:0
.21505,(1.0:0.20165,1.0:0.20165):0.0134):0.00209):0.00015):0.00237,1.0:0.21965):0.0129):0
.0038,1.0:0.23635):0.00362,1.0:0.23997):0.00145):0.00194):0.00004,1.0:0.2434):0.00045,1.
0:0.24385):0.00004,1.0:0.24389):0.00182,1.0:0.24572):0.00524,1.0:0.25096):0.00051,(1.0:0.
22471,1.0:0.22471):0.02676):0.00431,(1.0:0.25456,1.0:0.25456):0.00122):0.00048,(1.0:0.20
094,1.0:0.20094):0.05532):0.00036,1.0:0.25662):0.00056,(1.0:0.25142,1.0:0.25142):0.00575
):0.00035,1.0:0.25752):0.00128,((1.0:0.2137,1.0:0.2137):0.02474,((1.0:0.1704,1.0:0.1704):0.
01336,1.0:0.18375):0.05468):0.02037):0.0002,1.0:0.259):0.00046,1.0:0.25946):0.00073,1.0:
0.26019):0.01098,1.0:0.27117):0.00069,((1.0:0.23177,1.0:0.23177):0.0174,1.0:0.24916):0.02
27):0.00023,1.0:0.27209):0.00147,1.0:0.27356):0.00135,1.0:0.27491):0.00057,(1.0:0.26536,
1.0:0.26536):0.01012):0.0015,1.0:0.27698):0.00095,(1.0:0.21835,1.0:0.21835):0.05959):0.00
243,1.0:0.28037):0.00132,((1.0:0.26062,1.0:0.26062):0.01101,1.0:0.27163):0.01006):0.0014
1,1.0:0.28311):0.00372,1.0:0.28683):0.00242,(1.0:0.25593,1.0:0.25593):0.03332):0.00051,1.
0:0.28976):0.00018,1.0:0.28994):0.0014,1.0:0.29134):0.00074,1.0:0.29208):0.00025,1.0:0.29
233):0.00213,1.0:0.29446):0.00679,(1.0:0.21268,1.0:0.21268):0.08856):0.00172,1.0:0.30297
):0.00293,1.0:0.30589):0.00021,1.0:0.3061):0.0009,((1.0:0.27068,1.0:0.27068):0.02674,1.0:0
.29742):0.00958):0.00083,1.0:0.30783):0.00507,1.0:0.31289):0.00314,1.0:0.31603):0.00608,
1.0:0.32211):0.0004,1.0:0.32251):0.02337,1.0:0.34588):0.00909,1.0:0.35497):0.00397,1.0:0.
35894):0.00676,(1.0:0.26269,1.0:0.26269):0.10302):0.00931,1.0:0.37501):0.02586,1.0:0.400
87):0.01304,1.0:0.41392):0.00996,1.0:0.42387):0.0081,1.0:0.43197):0.00906,1.0:0.44103):0.
00283,(((1.0:0.2002,1.0:0.2002):0.09545,((((((((1.0:0.07706,1.0:0.07706):0.10649,1.0:0.1835
4):0.00398,1.0:0.18753):0.01702,(1.0:0.18261,1.0:0.18261):0.02193):0.00478,(1.0:0.18142,1
.0:0.18142):0.0279):0.00676,1.0:0.21608):0.04182,1.0:0.2579):0.03042,1.0:0.28832):0.0073
2):0.12487,(1.0:0.38103,1.0:0.38103):0.03948):0.02335):0.00943,1.0:0.45329):0.00495,1.0:0
.45825):0.02471,1.0:0.48296):0.05516,(1.0:0.23065,1.0:0.23065):0.30747):0.18743,1.0:0.72
555)
Cluster 1
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((0.0:0.12784,0.0:0.12784):0
.02126,((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((0.0:0.10816,0.0:0.10816):0.
00203,((((((0.0:0.09367,(0.0:0.09249,0.0:0.09249):0.00118):0.00512,0.0:0.09879):0.00942,((
0.0:0.09933,0.0:0.09933):0.00702,(0.0:0.08102,0.0:0.08102):0.02533):0.00185):0.00053,((0.
0:0.08313,0.0:0.08313):0.01927,(0.0:0.0883,0.0:0.0883):0.0141):0.00634):0.00064,0.0:0.109
38):0.00059,(((0.0:0.08446,0.0:0.08446):0.01277,(0.0:0.08911,0.0:0.08911):0.00813):0.0118
5,(((0.0:0.08048,0.0:0.08048):0.01638,(0.0:0.06765,0.0:0.06765):0.02921):0.00533,0.0:0.102
19):0.0069):0.00088):0.00023):0.00118,((0.0:0.1063,0.0:0.1063):0.00303,((((0.0:0.09459,(0.
0:0.08947,0.0:0.08947):0.00513):0.0103,(0.0:0.09104,0.0:0.09104):0.01385):0.00313,0.0:0.1
0802):0.00084,((0.0:0.09927,0.0:0.09927):0.00591,0.0:0.10518):0.00368):0.00046):0.00205)
:0.00066,0.0:0.11203):0.00026,((0.0:0.10296,0.0:0.10296):0.00875,0.0:0.11171):0.00057):0.
00043,0.0:0.11271):0.00097,((((0.0:0.0907,0.0:0.0907):0.00896,0.0:0.09966):0.00503,0.0:0.1
0469):0.00639,0.0:0.11108):0.00261):0.0005,0.0:0.11419):0.00175,0.0:0.11594):0.00074,0.0
:0.11668):0.00052,0.0:0.1172):0.00081,((0.0:0.09121,0.0:0.09121):0.00833,0.0:0.09954):0.0
1848):0.00183,((0.0:0.07787,(0.0:0.07332,0.0:0.07332):0.00455):0.02259,(0.0:0.09739,(0.0:
0.08185,0.0:0.08185):0.01554):0.00307):0.01939):0.00152,0.0:0.12137):0.0008,0.0:0.12217)
:0.00112,0.0:0.12328):0.00028,0.0:0.12356):0.00104,((0.0:0.11603,0.0:0.11603):0.00724,0.0
:0.12327):0.00134):0.00037,0.0:0.12498):0.00021,0.0:0.12519):0.00045,0.0:0.12564):0.0027
7,0.0:0.12841):0.00003,0.0:0.12843):0.00174,0.0:0.13018):0.00049,((0.0:0.1125,0.0:0.1125):
0.01473,0.0:0.12723):0.00344):0.00104,(((0.0:0.10859,0.0:0.10859):0.00133,(0.0:0.09218,0.
0:0.09218):0.01773):0.0082,0.0:0.11812):0.01359):0.00004,0.0:0.13174):0.00055,(((0.0:0.12
664,(0.0:0.10574,0.0:0.10574):0.0209):0.00316,(0.0:0.12228,0.0:0.12228):0.00752):0.00044,
0.0:0.13024):0.00205):0.00007,(0.0:0.08845,0.0:0.08845):0.0439):0.00015,(((0.0:0.10507,0.
0:0.10507):0.02344,0.0:0.12852):0.00274,(((((0.0:0.12282,((((((0.0:0.09173,0.0:0.09173):0.0
1309,0.0:0.10482):0.00136,0.0:0.10619):0.00937,0.0:0.11555):0.00277,0.0:0.11832):0.00297
,0.0:0.12129):0.00153):0.00011,0.0:0.12293):0.00084,((0.0:0.10328,0.0:0.10328):0.02003,(0.
0:0.10957,0.0:0.10957):0.01374):0.00045):0.00499,0.0:0.12876):0.00145,0.0:0.13021):0.001
04):0.00125):0.00009,0.0:0.13259):0.00066,(0.0:0.06204,0.0:0.06204):0.07121):0.00007,0.0:
0.13332):0.00027,0.0:0.13359):0.00027,(0.0:0.11596,0.0:0.11596):0.0179):0.0001,(((((0.0:0.
10673,0.0:0.10673):0.00493,0.0:0.11166):0.01013,(((0.0:0.06145,0.0:0.06145):0.01551,(0.0:
0.07528,0.0:0.07528):0.00168):0.03447,0.0:0.11142):0.01037):0.0066,(((0.0:0.09802,0.0:0.0
9802):0.01665,0.0:0.11467):0.00208,0.0:0.11675):0.01165):0.00189,0.0:0.13029):0.00368):0
.00017,0.0:0.13414):0.00008,0.0:0.13422):0.00024,(((0.0:0.11379,0.0:0.11379):0.00091,0.0:
0.1147):0.01783,0.0:0.13253):0.00193):0.00048,0.0:0.13494):0.0009,(0.0:0.13076,0.0:0.130
76):0.00508):0.00001,0.0:0.13585):0.00031,(((0.0:0.08817,0.0:0.08817):0.03042,0.0:0.11858
):0.01651,0.0:0.13509):0.00107):0.0005,(((0.0:0.07923,0.0:0.07923):0.04138,0.0:0.12061):0.
00565,((((0.0:0.08721,0.0:0.08721):0.01942,0.0:0.10663):0.0001,0.0:0.10673):0.01725,0.0:0.
12397):0.00228):0.01041):0.00013,0.0:0.13679):0.00076,0.0:0.13755):0.00081,(0.0:0.10619,
0.0:0.10619):0.03217):0.00001,(((0.0:0.06981,0.0:0.06981):0.00908,0.0:0.07889):0.0537,((0.
0:0.08438,0.0:0.08438):0.01431,0.0:0.0987):0.03389):0.00578):0.00029,0.0:0.13866):0.0009
3,0.0:0.13959):0.00034,0.0:0.13993):0.00033,0.0:0.14027):0.00009,0.0:0.14036):0.00173,0.0
:0.14209):0.00011,0.0:0.1422):0.00043,0.0:0.14264):0.00012,0.0:0.14276):0.00066,0.0:0.143
42):0.00045,((((0.0:0.11735,0.0:0.11735):0.00168,0.0:0.11903):0.01472,0.0:0.13375):0.0016
2,0.0:0.13537):0.0085):0.00036,(0.0:0.13414,0.0:0.13414):0.01008):0.00104,((0.0:0.11393,0.
0:0.11393):0.02814,0.0:0.14206):0.0032):0.00001,0.0:0.14527):0.00013,0.0:0.1454):0.00034
,0.0:0.14574):0.00083,0.0:0.14657):0.00008,0.0:0.14665):0.00112,0.0:0.14777):0.00024,0.0:
0.148):0.00014,(0.0:0.1213,0.0:0.1213):0.02685):0.00081,((0.0:0.10356,0.0:0.10356):0.0198
3,0.0:0.12339):0.02557):0.00014):0.00012,(0.0:0.14704,0.0:0.14704):0.00217):0.00048,0.0:0
.14969):0.00049,0.0:0.15019):0.00005,0.0:0.15024):0.00037,0.0:0.15061):0.00002,(0.0:0.14
482,0.0:0.14482):0.00581):0.00119,0.0:0.15182):0.00012,((0.0:0.13597,0.0:0.13597):0.0106
1,0.0:0.14659):0.00535):0.00002,0.0:0.15196):0.00033,0.0:0.15228):0.00001,0.0:0.15229):0.
0004,0.0:0.1527):0.00019,(0.0:0.13877,0.0:0.13877):0.01412):0.00027,0.0:0.15316):0.00027
,0.0:0.15343):0.00035,0.0:0.15379):0.00006,((0.0:0.12172,0.0:0.12172):0.01057,0.0:0.13229
):0.02156):0.00272,0.0:0.15657):0.00014,((0.0:0.12691,(0.0:0.11008,0.0:0.11008):0.01682):
0.02878,0.0:0.15569):0.00102):0.00107,((0.0:0.13458,0.0:0.13458):0.01064,0.0:0.14523):0.0
1255):0.00008,0.0:0.15786):0.00003,0.0:0.15789):0.00046,0.0:0.15834):0,0.0:0.15835):0.00
038,((((0.0:0.10571,0.0:0.10571):0.03476,0.0:0.14047):0.01089,0.0:0.15136):0.00143,0.0:0.1
5279):0.00594):0.00041,0.0:0.15914):0.0002,(0.0:0.15804,((0.0:0.1349,0.0:0.1349):0.00063,
0.0:0.13553):0.02251):0.00129):0.00044,0.0:0.15978):0.00045,0.0:0.16022):0.00002,0.0:0.1
6024):0.00017,0.0:0.16041):0.00009,0.0:0.1605):0.0005,0.0:0.161):0.00007,0.0:0.16107):0.0
0057,(((((0.0:0.12518,((0.0:0.10236,0.0:0.10236):0.01969,0.0:0.12204):0.00314):0.01249,0.0
:0.13767):0.00522,(0.0:0.11337,0.0:0.11337):0.02952):0.00949,0.0:0.15238):0.00424,0.0:0.1
5662):0.00502):0.00027,0.0:0.16192):0.00002,(0.0:0.12034,0.0:0.12034):0.0416):0.00191,0.
0:0.16384):0.0007,((0.0:0.15793,0.0:0.15793):0.00276,0.0:0.1607):0.00385):0.00088,(0.0:0.1
3835,0.0:0.13835):0.02707):0.00037,0.0:0.16579):0.00004,0.0:0.16583):0.00004,0.0:0.16588
):0.00065,0.0:0.16653):0.00004,0.0:0.16657):0.0024,0.0:0.16897):0.00087,0.0:0.16984):0.00
082,0.0:0.17066):0.00002,0.0:0.17068):0.00054,0.0:0.17121):0.00021,((0.0:0.06936,0.0:0.06
936):0.0717,0.0:0.14107):0.03036):0.00066,0.0:0.17208):0.00012,0.0:0.17221):0.00025,0.0:
0.17245):0.00017,(0.0:0.16416,0.0:0.16416):0.00845):0.00062,0.0:0.17324):0.00081,0.0:0.1
7405):0.00002,0.0:0.17407):0.00174,0.0:0.17582):0.00045,(0.0:0.15704,0.0:0.15704):0.0192
2):0.00028,0.0:0.17654):0.00001,((0.0:0.13907,0.0:0.13907):0.03228,0.0:0.17135):0.0052):0.
00025,0.0:0.17681):0.00058,(0.0:0.13461,0.0:0.13461):0.04278):0.00001,0.0:0.1774):0.0004
7,0.0:0.17787):0.00002,0.0:0.17789):0.00255,0.0:0.18044):0.00129,0.0:0.18174):0.00028,0.0
:0.18202):0.00006,(((0.0:0.1531,(0.0:0.13852,0.0:0.13852):0.01458):0.00765,0.0:0.16075):0.
01721,0.0:0.17796):0.00412):0.00226,0.0:0.18434):0.00085,0.0:0.1852):0.00033,0.0:0.18553
):0.00019,(0.0:0.13246,0.0:0.13246):0.05326):0.00088,0.0:0.18659):0.00099,(0.0:0.18585,(0.
0:0.13555,0.0:0.13555):0.0503):0.00173):0.00004,0.0:0.18762):0.0002,0.0:0.18782):0.0002,
0.0:0.18803):0.00046,((0.0:0.15073,0.0:0.15073):0.01972,0.0:0.17046):0.01803):0.00001,0.0
:0.1885):0.00028,0.0:0.18878):0.00158,(((0.0:0.13908,0.0:0.13908):0.00637,((0.0:0.10893,0.
0:0.10893):0.0035,0.0:0.11243):0.03302):0.02515,((0.0:0.14117,0.0:0.14117):0.00027,0.0:0.
14144):0.02917):0.01976):0.00024,0.0:0.19061):0.00002,((0.0:0.17797,0.0:0.17797):0.00341
,0.0:0.18138):0.00925):0.0005,((0.0:0.17008,0.0:0.17008):0.01892,0.0:0.18899):0.00213):0.0
0258,(0.0:0.19123,(0.0:0.13139,0.0:0.13139):0.05984):0.00248):0.00029,0.0:0.19399):0.000
84,0.0:0.19483):0.00095,0.0:0.19578):0.00069,0.0:0.19647):0.00041,(0.0:0.17397,0.0:0.1739
7):0.02292):0.00121,0.0:0.1981):0.0002,(0.0:0.15606,0.0:0.15606):0.04223):0.00024,0.0:0.1
9854):0.00024,0.0:0.19878):0.00094,0.0:0.19972):0.00007,0.0:0.19979):0.00162,0.0:0.20141
):0.0016,0.0:0.20301):0.00022,(0.0:0.19771,(0.0:0.18243,0.0:0.18243):0.01528):0.00551):0.0
003,(((0.0:0.16452,0.0:0.16452):0.03227,((((0.0:0.16675,0.0:0.16675):0.00187,0.0:0.16862):
0.00353,0.0:0.17215):0.00299,0.0:0.17514):0.02165):0.00126,0.0:0.19805):0.00548):0.0001
2,0.0:0.20365):0.00149,((0.0:0.12373,0.0:0.12373):0.03739,(0.0:0.08649,0.0:0.08649):0.074
63):0.04401):0.00211,0.0:0.20724):0.00029,0.0:0.20754):0.00055,0.0:0.20808):0.0009,0.0:0.
20899):0.00072,((0.0:0.15096,(0.0:0.1181,0.0:0.1181):0.03286):0.02257,0.0:0.17354):0.0361
7):0.00287,(0.0:0.1976,0.0:0.1976):0.01497):0.00039,0.0:0.21296):0.0007,0.0:0.21366):0.00
156,(0.0:0.18533,0.0:0.18533):0.02988):0.00044,0.0:0.21565):0.00135,0.0:0.21701):0.00074
,(0.0:0.21419,0.0:0.21419):0.00355):0.00023,0.0:0.21798):0.00004,0.0:0.21802):0.00008,(0.
0:0.21715,0.0:0.21715):0.00095):0.0003,0.0:0.2184):0.00064,0.0:0.21904):0.00082,((0.0:0.1
8749,0.0:0.18749):0.02331,(0.0:0.18397,0.0:0.18397):0.02684):0.00905):0.002,(0.0:0.20591,
0.0:0.20591):0.01595):0.00172,0.0:0.22359):0.00014,0.0:0.22372):0.00175,0.0:0.22547):0.0
0266,0.0:0.22813):0.0005,0.0:0.22864):0.00087,(0.0:0.2208,0.0:0.2208):0.00871):0.00339,0.
0:0.2329):0.00003,0.0:0.23293):0.00054,((((((((0.0:0.20075,0.0:0.20075):0.00119,0.0:0.2019
4):0.00451,0.0:0.20646):0.00792,(0.0:0.16895,0.0:0.16895):0.04542):0.00045,0.0:0.21483):0
.00646,0.0:0.22129):0.00185,0.0:0.22314):0.00732,0.0:0.23047):0.00301):0.00165,0.0:0.235
12):0.00172,0.0:0.23684):0.00043,0.0:0.23726):0.00011,0.0:0.23737):0.00081,(0.0:0.20574,
0.0:0.20574):0.03245):0.00012,0.0:0.23831):0.00217,0.0:0.24048):0.00023,0.0:0.2407):0.00
094,0.0:0.24164):0.00019,0.0:0.24184):0.00024,0.0:0.24208):0.00006,0.0:0.24214):0.00089,
0.0:0.24303):0.00278,0.0:0.24581):0.0008,0.0:0.24662):0.0013,0.0:0.24792):0.00187,(0.0:0.
09402,0.0:0.09402):0.15577):0.00068,0.0:0.25047):0.00108,0.0:0.25155):0.00669,0.0:0.2582
5):0.00237,(0.0:0.2179,0.0:0.2179):0.04271):0.00167,0.0:0.26228):0.01979,(0.0:0.19295,0.0:
0.19295):0.08912):0.00195,0.0:0.28402):0.00419,(0.0:0.2702,((0.0:0.19971,((0.0:0.12943,0.
0:0.12943):0.04095,0.0:0.17038):0.02933):0.01983,(((0.0:0.11078,0.0:0.11078):0.02956,0.0:
0.14033):0.01585,0.0:0.15618):0.06336):0.05066):0.01802):0.00131,(0.0:0.2518,0.0:0.2518):
0.03772):0.00227,0.0:0.29179):0.00139,0.0:0.29318):0.00039,0.0:0.29357):0.00252,0.0:0.29
609):0.00043,0.0:0.29652):0.00169,0.0:0.29821):0.00417,0.0:0.30239):0.00032,0.0:0.3027):
0.00099,0.0:0.30369):0.00631,0.0:0.31):0.00021,0.0:0.31021):0.00031,0.0:0.31053):0.01054,
(0.0:0.23031,(0.0:0.17882,0.0:0.17882):0.05149):0.09076):0.00359,(0.0:0.32171,0.0:0.32171
):0.00294):0.00615,0.0:0.3308):0.0008,0.0:0.3316):0.00067,0.0:0.33227):0.00144,((0.0:0.095
24,0.0:0.09524):0.15936,((0.0:0.08499,(0.0:0.0729,0.0:0.0729):0.01209):0.06072,0.0:0.1457
1):0.10889):0.07912):0.00578,0.0:0.3395):0.00038,0.0:0.33988):0.00881,(0.0:0.3111,0.0:0.3
111):0.03759):0.02823,0.0:0.37692):0.00744,0.0:0.38436):0.0335,0.0:0.41787):0.02278,0.0:
0.44065):0.00258,0.0:0.44323):0.00159,(0.0:0.43333,0.0:0.43333):0.01148):0.00063,0.0:0.4
4544):0.00154,0.0:0.44698):0.05166,0.0:0.49865):0.27751,0.0:0.77616)
Time taken to build model (full training data) : 2.18 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 268 ( 35%)
1 500 ( 65%)
Aim:
To apply DBSCAN algorithm on numeric dataset and estimate cluster quality .
DBSCAN Algorithm:
DBSCAN is the one of the most common clustering algorithm and also most cited in
scientific literature .It is a well known data clustering algorithm that is commonly used in
data mining and machine learning .Based on a set of points .DBSCAN require two
parameters eps and minPoints.
The parameter estimation is a problem for every data mining task .To choose good
parameters we need to understand how they are used and have at least a basic previous
knowledge about the data set that will be used.
DBSCAN groups together points that are close to each other based on a distance
measurement and a minimum number of points . It also marks as outliers the points that are in
low density regions .The DBSCAN algorithm should be used to find associations and
structures in data that are hard to find manually but that can be relevant and useful to find
patterns and predicted trends
Procedure:
STEP 1: Choose the data file required to cluster.

STEP 2: Then select cluster option in the menu.
STEP 3: Choose the clustering algorithm
STEP 4: Select the DBSCAN algorithm and then click Start.
STEP 5: Then the data file is processed and the resultant is displayed in cluster output.
Output:
Scheme: weka.clusterers.DBSCAN -E 0.9 -M 6 -A "weka.core.EuclideanDistance -R first-last"
Instances: 768
Attributes: 9
preg
plas
pres
skin
insu
mass
pedi
age
class
Test mode: evaluate on training data
=== Clustering model (full training set) ===
DBSCAN clustering results
==================================================================================
======
Clustered DataObjects: 768
Number of attributes: 9
Epsilon: 0.9; minPoints: 6
Distance-type: Number of generated clusters: 2
Elapsed time: .27
( 0.) 6,148,72,35,0,33.6,0.627,50,tested_positive --> 0
( 1.) 1,85,66,29,0,26.6,0.351,31,tested_negative --> 1
( 2.) 8,183,64,0,0,23.3,0.672,32,tested_positive --> 0
( 3.) 1,89,66,23,94,28.1,0.167,21,tested_negative --> 1
( 4.) 0,137,40,35,168,43.1,2.288,33,tested_positive --> 0
( 5.) 5,116,74,0,0,25.6,0.201,30,tested_negative --> 1
( 6.) 3,78,50,32,88,31,0.248,26,tested_positive --> 0
( 7.) 10,115,0,0,0,35.3,0.134,29,tested_negative --> 1
( 8.) 2,197,70,45,543,30.5,0.158,53,tested_positive --> 0
( 9.) 8,125,96,0,0,0,0.232,54,tested_positive --> 0
( 10.) 4,110,92,0,0,37.6,0.191,30,tested_negative --> 1
( 11.) 10,168,74,0,0,38,0.537,34,tested_positive --> 0
( 12.) 10,139,80,0,0,27.1,1.441,57,tested_negative --> 1
( 13.) 1,189,60,23,846,30.1,0.398,59,tested_positive --> 0
( 14.) 5,166,72,19,175,25.8,0.587,51,tested_positive --> 0
( 15.) 7,100,0,0,0,30,0.484,32,tested_positive --> 0
( 16.) 0,118,84,47,230,45.8,0.551,31,tested_positive --> 0

( 17.) 7,107,74,0,0,29.6,0.254,31,tested_positive --> 0
( 18.) 1,103,30,38,83,43.3,0.183,33,tested_negative --> 1
( 19.) 1,115,70,30,96,34.6,0.529,32,tested_positive --> 0
( 20.) 3,126,88,41,235,39.3,0.704,27,tested_negative --> 1
( 21.) 8,99,84,0,0,35.4,0.388,50,tested_negative --> 1
( 22.) 7,196,90,0,0,39.8,0.451,41,tested_positive --> 0
( 23.) 9,119,80,35,0,29,0.263,29,tested_positive --> 0
( 24.) 11,143,94,33,146,36.6,0.254,51,tested_positive --> 0
( 25.) 10,125,70,26,115,31.1,0.205,41,tested_positive --> 0
( 26.) 7,147,76,0,0,39.4,0.257,43,tested_positive --> 0
( 27.) 1,97,66,15,140,23.2,0.487,22,tested_negative --> 1
( 28.) 13,145,82,19,110,22.2,0.245,57,tested_negative --> 1
( 29.) 5,117,92,0,0,34.1,0.337,38,tested_negative --> 1
( 30.) 5,109,75,26,0,36,0.546,60,tested_negative --> 1
( 31.) 3,158,76,36,245,31.6,0.851,28,tested_positive --> 0
( 32.) 3,88,58,11,54,24.8,0.267,22,tested_negative --> 1
( 33.) 6,92,92,0,0,19.9,0.188,28,tested_negative --> 1
( 34.) 10,122,78,31,0,27.6,0.512,45,tested_negative --> 1
( 35.) 4,103,60,33,192,24,0.966,33,tested_negative --> 1
( 36.) 11,138,76,0,0,33.2,0.42,35,tested_negative --> 1
( 37.) 9,102,76,37,0,32.9,0.665,46,tested_positive --> 0
( 38.) 2,90,68,42,0,38.2,0.503,27,tested_positive --> 0
( 39.) 4,111,72,47,207,37.1,1.39,56,tested_positive --> 0
( 40.) 3,180,64,25,70,34,0.271,26,tested_negative --> 1
( 41.) 7,133,84,0,0,40.2,0.696,37,tested_negative --> 1
( 42.) 7,106,92,18,0,22.7,0.235,48,tested_negative --> 1
( 43.) 9,171,110,24,240,45.4,0.721,54,tested_positive --> 0
( 44.) 7,159,64,0,0,27.4,0.294,40,tested_negative --> 1
( 45.) 0,180,66,39,0,42,1.893,25,tested_positive --> 0
( 46.) 1,146,56,0,0,29.7,0.564,29,tested_negative --> 1
( 47.) 2,71,70,27,0,28,0.586,22,tested_negative --> 1

( 48.) 7,103,66,32,0,39.1,0.344,31,tested_positive --> 0
( 49.) 7,105,0,0,0,0,0.305,24,tested_negative --> 1
( 50.) 1,103,80,11,82,19.4,0.491,22,tested_negative --> 1
( 51.) 1,101,50,15,36,24.2,0.526,26,tested_negative --> 1
( 52.) 5,88,66,21,23,24.4,0.342,30,tested_negative --> 1
( 53.) 8,176,90,34,300,33.7,0.467,58,tested_positive --> 0
( 54.) 7,150,66,42,342,34.7,0.718,42,tested_negative --> 1
( 55.) 1,73,50,10,0,23,0.248,21,tested_negative --> 1
( 56.) 7,187,68,39,304,37.7,0.254,41,tested_positive --> 0
( 57.) 0,100,88,60,110,46.8,0.962,31,tested_negative --> 1
( 58.) 0,146,82,0,0,40.5,1.781,44,tested_negative --> 1
( 59.) 0,105,64,41,142,41.5,0.173,22,tested_negative --> 1
( 60.) 2,84,0,0,0,0,0.304,21,tested_negative --> 1
( 61.) 8,133,72,0,0,32.9,0.27,39,tested_positive --> 0
( 62.) 5,44,62,0,0,25,0.587,36,tested_negative --> 1
( 63.) 2,141,58,34,128,25.4,0.699,24,tested_negative --> 1
( 64.) 7,114,66,0,0,32.8,0.258,42,tested_positive --> 0
( 65.) 5,99,74,27,0,29,0.203,32,tested_negative --> 1
( 66.) 0,109,88,30,0,32.5,0.855,38,tested_positive --> 0
( 67.) 2,109,92,0,0,42.7,0.845,54,tested_negative --> 1
( 68.) 1,95,66,13,38,19.6,0.334,25,tested_negative --> 1
( 69.) 4,146,85,27,100,28.9,0.189,27,tested_negative --> 1
( 70.) 2,100,66,20,90,32.9,0.867,28,tested_positive --> 0
( 71.) 5,139,64,35,140,28.6,0.411,26,tested_negative --> 1
( 72.) 13,126,90,0,0,43.4,0.583,42,tested_positive --> 0
( 73.) 4,129,86,20,270,35.1,0.231,23,tested_negative --> 1
( 74.) 1,79,75,30,0,32,0.396,22,tested_negative --> 1
( 75.) 1,0,48,20,0,24.7,0.14,22,tested_negative --> 1
( 76.) 7,62,78,0,0,32.6,0.391,41,tested_negative --> 1
( 77.) 5,95,72,33,0,37.7,0.37,27,tested_negative --> 1
( 78.) 0,131,0,0,0,43.2,0.27,26,tested_positive --> 0

( 79.) 2,112,66,22,0,25,0.307,24,tested_negative --> 1
( 80.) 3,113,44,13,0,22.4,0.14,22,tested_negative --> 1
( 81.) 2,74,0,0,0,0,0.102,22,tested_negative --> 1
( 82.) 7,83,78,26,71,29.3,0.767,36,tested_negative --> 1
( 83.) 0,101,65,28,0,24.6,0.237,22,tested_negative --> 1
( 84.) 5,137,108,0,0,48.8,0.227,37,tested_positive --> 0
( 85.) 2,110,74,29,125,32.4,0.698,27,tested_negative --> 1
( 86.) 13,106,72,54,0,36.6,0.178,45,tested_negative --> 1
( 87.) 2,100,68,25,71,38.5,0.324,26,tested_negative --> 1
( 88.) 15,136,70,32,110,37.1,0.153,43,tested_positive --> 0
( 89.) 1,107,68,19,0,26.5,0.165,24,tested_negative --> 1
( 90.) 1,80,55,0,0,19.1,0.258,21,tested_negative --> 1
( 91.) 4,123,80,15,176,32,0.443,34,tested_negative --> 1
( 92.) 7,81,78,40,48,46.7,0.261,42,tested_negative --> 1
( 93.) 4,134,72,0,0,23.8,0.277,60,tested_positive --> 0
( 94.) 2,142,82,18,64,24.7,0.761,21,tested_negative --> 1
( 95.) 6,144,72,27,228,33.9,0.255,40,tested_negative --> 1
( 96.) 2,92,62,28,0,31.6,0.13,24,tested_negative --> 1
( 97.) 1,71,48,18,76,20.4,0.323,22,tested_negative --> 1
( 98.) 6,93,50,30,64,28.7,0.356,23,tested_negative --> 1
( 99.) 1,122,90,51,220,49.7,0.325,31,tested_positive --> 0
(100.) 1,163,72,0,0,39,1.222,33,tested_positive --> 0
(101.) 1,151,60,0,0,26.1,0.179,22,tested_negative --> 1
(102.) 0,125,96,0,0,22.5,0.262,21,tested_negative --> 1
(103.) 1,81,72,18,40,26.6,0.283,24,tested_negative --> 1
(104.) 2,85,65,0,0,39.6,0.93,27,tested_negative --> 1
(105.) 1,126,56,29,152,28.7,0.801,21,tested_negative --> 1
(106.) 1,96,122,0,0,22.4,0.207,27,tested_negative --> 1
(107.) 4,144,58,28,140,29.5,0.287,37,tested_negative --> 1
(108.) 3,83,58,31,18,34.3,0.336,25,tested_negative --> 1
(109.) 0,95,85,25,36,37.4,0.247,24,tested_positive --> 0

(110.) 3,171,72,33,135,33.3,0.199,24,tested_positive --> 0
(111.) 8,155,62,26,495,34,0.543,46,tested_positive --> 0
(112.) 1,89,76,34,37,31.2,0.192,23,tested_negative --> 1
(113.) 4,76,62,0,0,34,0.391,25,tested_negative --> 1
(114.) 7,160,54,32,175,30.5,0.588,39,tested_positive --> 0
(115.) 4,146,92,0,0,31.2,0.539,61,tested_positive --> 0
(116.) 5,124,74,0,0,34,0.22,38,tested_positive --> 0
(117.) 5,78,48,0,0,33.7,0.654,25,tested_negative --> 1
(118.) 4,97,60,23,0,28.2,0.443,22,tested_negative --> 1
(119.) 4,99,76,15,51,23.2,0.223,21,tested_negative --> 1
(120.) 0,162,76,56,100,53.2,0.759,25,tested_positive --> 0
(121.) 6,111,64,39,0,34.2,0.26,24,tested_negative --> 1
(122.) 2,107,74,30,100,33.6,0.404,23,tested_negative --> 1
(123.) 5,132,80,0,0,26.8,0.186,69,tested_negative --> 1
(124.) 0,113,76,0,0,33.3,0.278,23,tested_positive --> 0
(125.) 1,88,30,42,99,55,0.496,26,tested_positive --> 0
(126.) 3,120,70,30,135,42.9,0.452,30,tested_negative --> 1
(127.) 1,118,58,36,94,33.3,0.261,23,tested_negative --> 1
(128.) 1,117,88,24,145,34.5,0.403,40,tested_positive --> 0
(129.) 0,105,84,0,0,27.9,0.741,62,tested_positive --> 0
(130.) 4,173,70,14,168,29.7,0.361,33,tested_positive --> 0
(131.) 9,122,56,0,0,33.3,1.114,33,tested_positive --> 0
(132.) 3,170,64,37,225,34.5,0.356,30,tested_positive --> 0
(133.) 8,84,74,31,0,38.3,0.457,39,tested_negative --> 1
(134.) 2,96,68,13,49,21.1,0.647,26,tested_negative --> 1
(135.) 2,125,60,20,140,33.8,0.088,31,tested_negative --> 1
(136.) 0,100,70,26,50,30.8,0.597,21,tested_negative --> 1
(137.) 0,93,60,25,92,28.7,0.532,22,tested_negative --> 1
(138.) 0,129,80,0,0,31.2,0.703,29,tested_negative --> 1
(139.) 5,105,72,29,325,36.9,0.159,28,tested_negative --> 1
(140.) 3,128,78,0,0,21.1,0.268,55,tested_negative --> 1

(141.) 5,106,82,30,0,39.5,0.286,38,tested_negative --> 1
(142.) 2,108,52,26,63,32.5,0.318,22,tested_negative --> 1
(143.) 10,108,66,0,0,32.4,0.272,42,tested_positive --> 0
(144.) 4,154,62,31,284,32.8,0.237,23,tested_negative --> 1
(145.) 0,102,75,23,0,0,0.572,21,tested_negative --> 1
(146.) 9,57,80,37,0,32.8,0.096,41,tested_negative --> 1
(147.) 2,106,64,35,119,30.5,1.4,34,tested_negative --> 1
(148.) 5,147,78,0,0,33.7,0.218,65,tested_negative --> 1
(149.) 2,90,70,17,0,27.3,0.085,22,tested_negative --> 1
(150.) 1,136,74,50,204,37.4,0.399,24,tested_negative --> 1
(151.) 4,114,65,0,0,21.9,0.432,37,tested_negative --> 1
(152.) 9,156,86,28,155,34.3,1.189,42,tested_positive --> 0
(153.) 1,153,82,42,485,40.6,0.687,23,tested_negative --> 1
(154.) 8,188,78,0,0,47.9,0.137,43,tested_positive --> 0
(155.) 7,152,88,44,0,50,0.337,36,tested_positive --> 0
(156.) 2,99,52,15,94,24.6,0.637,21,tested_negative --> 1
(157.) 1,109,56,21,135,25.2,0.833,23,tested_negative --> 1
(158.) 2,88,74,19,53,29,0.229,22,tested_negative --> 1
(159.) 17,163,72,41,114,40.9,0.817,47,tested_positive --> 0
(160.) 4,151,90,38,0,29.7,0.294,36,tested_negative --> 1
(161.) 7,102,74,40,105,37.2,0.204,45,tested_negative --> 1
(162.) 0,114,80,34,285,44.2,0.167,27,tested_negative --> 1
(163.) 2,100,64,23,0,29.7,0.368,21,tested_negative --> 1
(164.) 0,131,88,0,0,31.6,0.743,32,tested_positive --> 0
(165.) 6,104,74,18,156,29.9,0.722,41,tested_positive --> 0
(166.) 3,148,66,25,0,32.5,0.256,22,tested_negative --> 1
(167.) 4,120,68,0,0,29.6,0.709,34,tested_negative --> 1
(168.) 4,110,66,0,0,31.9,0.471,29,tested_negative --> 1
(169.) 3,111,90,12,78,28.4,0.495,29,tested_negative --> 1
(170.) 6,102,82,0,0,30.8,0.18,36,tested_positive --> 0
(171.) 6,134,70,23,130,35.4,0.542,29,tested_positive --> 0

(172.) 2,87,0,23,0,28.9,0.773,25,tested_negative --> 1
(173.) 1,79,60,42,48,43.5,0.678,23,tested_negative --> 1
(174.) 2,75,64,24,55,29.7,0.37,33,tested_negative --> 1
(175.) 8,179,72,42,130,32.7,0.719,36,tested_positive --> 0
(176.) 6,85,78,0,0,31.2,0.382,42,tested_negative --> 1
(177.) 0,129,110,46,130,67.1,0.319,26,tested_positive --> 0
(178.) 5,143,78,0,0,45,0.19,47,tested_negative --> 1
(179.) 5,130,82,0,0,39.1,0.956,37,tested_positive --> 0
(180.) 6,87,80,0,0,23.2,0.084,32,tested_negative --> 1
(181.) 0,119,64,18,92,34.9,0.725,23,tested_negative --> 1
(182.) 1,0,74,20,23,27.7,0.299,21,tested_negative --> 1
(183.) 5,73,60,0,0,26.8,0.268,27,tested_negative --> 1
(184.) 4,141,74,0,0,27.6,0.244,40,tested_negative --> 1
(185.) 7,194,68,28,0,35.9,0.745,41,tested_positive --> 0
(186.) 8,181,68,36,495,30.1,0.615,60,tested_positive --> 0
(187.) 1,128,98,41,58,32,1.321,33,tested_positive --> 0
(188.) 8,109,76,39,114,27.9,0.64,31,tested_positive --> 0
(189.) 5,139,80,35,160,31.6,0.361,25,tested_positive --> 0
(190.) 3,111,62,0,0,22.6,0.142,21,tested_negative --> 1
(191.) 9,123,70,44,94,33.1,0.374,40,tested_negative --> 1
(192.) 7,159,66,0,0,30.4,0.383,36,tested_positive --> 0
(193.) 11,135,0,0,0,52.3,0.578,40,tested_positive --> 0
(194.) 8,85,55,20,0,24.4,0.136,42,tested_negative --> 1
(195.) 5,158,84,41,210,39.4,0.395,29,tested_positive --> 0
(196.) 1,105,58,0,0,24.3,0.187,21,tested_negative --> 1
(197.) 3,107,62,13,48,22.9,0.678,23,tested_positive --> 0
(198.) 4,109,64,44,99,34.8,0.905,26,tested_positive --> 0
(199.) 4,148,60,27,318,30.9,0.15,29,tested_positive --> 0
(200.) 0,113,80,16,0,31,0.874,21,tested_negative --> 1
(201.) 1,138,82,0,0,40.1,0.236,28,tested_negative --> 1
(202.) 0,108,68,20,0,27.3,0.787,32,tested_negative --> 1

(203.) 2,99,70,16,44,20.4,0.235,27,tested_negative --> 1
(204.) 6,103,72,32,190,37.7,0.324,55,tested_negative --> 1
(205.) 5,111,72,28,0,23.9,0.407,27,tested_negative --> 1
(206.) 8,196,76,29,280,37.5,0.605,57,tested_positive --> 0
(207.) 5,162,104,0,0,37.7,0.151,52,tested_positive --> 0
(208.) 1,96,64,27,87,33.2,0.289,21,tested_negative --> 1
(209.) 7,184,84,33,0,35.5,0.355,41,tested_positive --> 0
(210.) 2,81,60,22,0,27.7,0.29,25,tested_negative --> 1
(211.) 0,147,85,54,0,42.8,0.375,24,tested_negative --> 1
(212.) 7,179,95,31,0,34.2,0.164,60,tested_negative --> 1
(213.) 0,140,65,26,130,42.6,0.431,24,tested_positive --> 0
(214.) 9,112,82,32,175,34.2,0.26,36,tested_positive --> 0
(215.) 12,151,70,40,271,41.8,0.742,38,tested_positive --> 0
(216.) 5,109,62,41,129,35.8,0.514,25,tested_positive --> 0
(217.) 6,125,68,30,120,30,0.464,32,tested_negative --> 1
(218.) 5,85,74,22,0,29,1.224,32,tested_positive --> 0
(219.) 5,112,66,0,0,37.8,0.261,41,tested_positive --> 0
(220.) 0,177,60,29,478,34.6,1.072,21,tested_positive --> 0
(221.) 2,158,90,0,0,31.6,0.805,66,tested_positive --> 0
(222.) 7,119,0,0,0,25.2,0.209,37,tested_negative --> 1
(223.) 7,142,60,33,190,28.8,0.687,61,tested_negative --> 1
(224.) 1,100,66,15,56,23.6,0.666,26,tested_negative --> 1
(225.) 1,87,78,27,32,34.6,0.101,22,tested_negative --> 1
(226.) 0,101,76,0,0,35.7,0.198,26,tested_negative --> 1
(227.) 3,162,52,38,0,37.2,0.652,24,tested_positive --> 0
(228.) 4,197,70,39,744,36.7,2.329,31,tested_negative --> 1
(229.) 0,117,80,31,53,45.2,0.089,24,tested_negative --> 1
(230.) 4,142,86,0,0,44,0.645,22,tested_positive --> 0
(231.) 6,134,80,37,370,46.2,0.238,46,tested_positive --> 0
(232.) 1,79,80,25,37,25.4,0.583,22,tested_negative --> 1
(233.) 4,122,68,0,0,35,0.394,29,tested_negative --> 1

(234.) 3,74,68,28,45,29.7,0.293,23,tested_negative --> 1
(235.) 4,171,72,0,0,43.6,0.479,26,tested_positive --> 0
(236.) 7,181,84,21,192,35.9,0.586,51,tested_positive --> 0
(237.) 0,179,90,27,0,44.1,0.686,23,tested_positive --> 0
(238.) 9,164,84,21,0,30.8,0.831,32,tested_positive --> 0
(239.) 0,104,76,0,0,18.4,0.582,27,tested_negative --> 1
(240.) 1,91,64,24,0,29.2,0.192,21,tested_negative --> 1
(241.) 4,91,70,32,88,33.1,0.446,22,tested_negative --> 1
(242.) 3,139,54,0,0,25.6,0.402,22,tested_positive --> 0
(243.) 6,119,50,22,176,27.1,1.318,33,tested_positive --> 0
(244.) 2,146,76,35,194,38.2,0.329,29,tested_negative --> 1
(245.) 9,184,85,15,0,30,1.213,49,tested_positive --> 0
(246.) 10,122,68,0,0,31.2,0.258,41,tested_negative --> 1
(247.) 0,165,90,33,680,52.3,0.427,23,tested_negative --> 1
(248.) 9,124,70,33,402,35.4,0.282,34,tested_negative --> 1
(249.) 1,111,86,19,0,30.1,0.143,23,tested_negative --> 1
(250.) 9,106,52,0,0,31.2,0.38,42,tested_negative --> 1
(251.) 2,129,84,0,0,28,0.284,27,tested_negative --> 1
(252.) 2,90,80,14,55,24.4,0.249,24,tested_negative --> 1
(253.) 0,86,68,32,0,35.8,0.238,25,tested_negative --> 1
(254.) 12,92,62,7,258,27.6,0.926,44,tested_positive --> 0
(255.) 1,113,64,35,0,33.6,0.543,21,tested_positive --> 0
(256.) 3,111,56,39,0,30.1,0.557,30,tested_negative --> 1
(257.) 2,114,68,22,0,28.7,0.092,25,tested_negative --> 1
(258.) 1,193,50,16,375,25.9,0.655,24,tested_negative --> 1
(259.) 11,155,76,28,150,33.3,1.353,51,tested_positive --> 0
(260.) 3,191,68,15,130,30.9,0.299,34,tested_negative --> 1
(261.) 3,141,0,0,0,30,0.761,27,tested_positive --> 0
(262.) 4,95,70,32,0,32.1,0.612,24,tested_negative --> 1
(263.) 3,142,80,15,0,32.4,0.2,63,tested_negative --> 1
(264.) 4,123,62,0,0,32,0.226,35,tested_positive --> 0

(265.) 5,96,74,18,67,33.6,0.997,43,tested_negative --> 1
(266.) 0,138,0,0,0,36.3,0.933,25,tested_positive --> 0
(267.) 2,128,64,42,0,40,1.101,24,tested_negative --> 1
(268.) 0,102,52,0,0,25.1,0.078,21,tested_negative --> 1
(269.) 2,146,0,0,0,27.5,0.24,28,tested_positive --> 0
(270.) 10,101,86,37,0,45.6,1.136,38,tested_positive --> 0
(271.) 2,108,62,32,56,25.2,0.128,21,tested_negative --> 1
(272.) 3,122,78,0,0,23,0.254,40,tested_negative --> 1
(273.) 1,71,78,50,45,33.2,0.422,21,tested_negative --> 1
(274.) 13,106,70,0,0,34.2,0.251,52,tested_negative --> 1
(275.) 2,100,70,52,57,40.5,0.677,25,tested_negative --> 1
(276.) 7,106,60,24,0,26.5,0.296,29,tested_positive --> 0
(277.) 0,104,64,23,116,27.8,0.454,23,tested_negative --> 1
(278.) 5,114,74,0,0,24.9,0.744,57,tested_negative --> 1
(279.) 2,108,62,10,278,25.3,0.881,22,tested_negative --> 1
(280.) 0,146,70,0,0,37.9,0.334,28,tested_positive --> 0
(281.) 10,129,76,28,122,35.9,0.28,39,tested_negative --> 1
(282.) 7,133,88,15,155,32.4,0.262,37,tested_negative --> 1
(283.) 7,161,86,0,0,30.4,0.165,47,tested_positive --> 0
(284.) 2,108,80,0,0,27,0.259,52,tested_positive --> 0
(285.) 7,136,74,26,135,26,0.647,51,tested_negative --> 1
(286.) 5,155,84,44,545,38.7,0.619,34,tested_negative --> 1
(287.) 1,119,86,39,220,45.6,0.808,29,tested_positive --> 0
(288.) 4,96,56,17,49,20.8,0.34,26,tested_negative --> 1
(289.) 5,108,72,43,75,36.1,0.263,33,tested_negative --> 1
(290.) 0,78,88,29,40,36.9,0.434,21,tested_negative --> 1
(291.) 0,107,62,30,74,36.6,0.757,25,tested_positive --> 0
(292.) 2,128,78,37,182,43.3,1.224,31,tested_positive --> 0
(293.) 1,128,48,45,194,40.5,0.613,24,tested_positive --> 0
(294.) 0,161,50,0,0,21.9,0.254,65,tested_negative --> 1
(295.) 6,151,62,31,120,35.5,0.692,28,tested_negative --> 1

(296.) 2,146,70,38,360,28,0.337,29,tested_positive --> 0
(297.) 0,126,84,29,215,30.7,0.52,24,tested_negative --> 1
(298.) 14,100,78,25,184,36.6,0.412,46,tested_positive --> 0
(299.) 8,112,72,0,0,23.6,0.84,58,tested_negative --> 1
(300.) 0,167,0,0,0,32.3,0.839,30,tested_positive --> 0
(301.) 2,144,58,33,135,31.6,0.422,25,tested_positive --> 0
(302.) 5,77,82,41,42,35.8,0.156,35,tested_negative --> 1
(303.) 5,115,98,0,0,52.9,0.209,28,tested_positive --> 0
(304.) 3,150,76,0,0,21,0.207,37,tested_negative --> 1
(305.) 2,120,76,37,105,39.7,0.215,29,tested_negative --> 1
(306.) 10,161,68,23,132,25.5,0.326,47,tested_positive --> 0
(307.) 0,137,68,14,148,24.8,0.143,21,tested_negative --> 1
(308.) 0,128,68,19,180,30.5,1.391,25,tested_positive --> 0
(309.) 2,124,68,28,205,32.9,0.875,30,tested_positive --> 0
(310.) 6,80,66,30,0,26.2,0.313,41,tested_negative --> 1
(311.) 0,106,70,37,148,39.4,0.605,22,tested_negative --> 1
(312.) 2,155,74,17,96,26.6,0.433,27,tested_positive --> 0
(313.) 3,113,50,10,85,29.5,0.626,25,tested_negative --> 1
(314.) 7,109,80,31,0,35.9,1.127,43,tested_positive --> 0
(315.) 2,112,68,22,94,34.1,0.315,26,tested_negative --> 1
(316.) 3,99,80,11,64,19.3,0.284,30,tested_negative --> 1
(317.) 3,182,74,0,0,30.5,0.345,29,tested_positive --> 0
(318.) 3,115,66,39,140,38.1,0.15,28,tested_negative --> 1
(319.) 6,194,78,0,0,23.5,0.129,59,tested_positive --> 0
(320.) 4,129,60,12,231,27.5,0.527,31,tested_negative --> 1
(321.) 3,112,74,30,0,31.6,0.197,25,tested_positive --> 0
(322.) 0,124,70,20,0,27.4,0.254,36,tested_positive --> 0
(323.) 13,152,90,33,29,26.8,0.731,43,tested_positive --> 0
(324.) 2,112,75,32,0,35.7,0.148,21,tested_negative --> 1
(325.) 1,157,72,21,168,25.6,0.123,24,tested_negative --> 1
(326.) 1,122,64,32,156,35.1,0.692,30,tested_positive --> 0

(327.) 10,179,70,0,0,35.1,0.2,37,tested_negative --> 1
(328.) 2,102,86,36,120,45.5,0.127,23,tested_positive --> 0
(329.) 6,105,70,32,68,30.8,0.122,37,tested_negative --> 1
(330.) 8,118,72,19,0,23.1,1.476,46,tested_negative --> 1
(331.) 2,87,58,16,52,32.7,0.166,25,tested_negative --> 1
(332.) 1,180,0,0,0,43.3,0.282,41,tested_positive --> 0
(333.) 12,106,80,0,0,23.6,0.137,44,tested_negative --> 1
(334.) 1,95,60,18,58,23.9,0.26,22,tested_negative --> 1
(335.) 0,165,76,43,255,47.9,0.259,26,tested_negative --> 1
(336.) 0,117,0,0,0,33.8,0.932,44,tested_negative --> 1
(337.) 5,115,76,0,0,31.2,0.343,44,tested_positive --> 0
(338.) 9,152,78,34,171,34.2,0.893,33,tested_positive --> 0
(339.) 7,178,84,0,0,39.9,0.331,41,tested_positive --> 0
(340.) 1,130,70,13,105,25.9,0.472,22,tested_negative --> 1
(341.) 1,95,74,21,73,25.9,0.673,36,tested_negative --> 1
(342.) 1,0,68,35,0,32,0.389,22,tested_negative --> 1
(343.) 5,122,86,0,0,34.7,0.29,33,tested_negative --> 1
(344.) 8,95,72,0,0,36.8,0.485,57,tested_negative --> 1
(345.) 8,126,88,36,108,38.5,0.349,49,tested_negative --> 1
(346.) 1,139,46,19,83,28.7,0.654,22,tested_negative --> 1
(347.) 3,116,0,0,0,23.5,0.187,23,tested_negative --> 1
(348.) 3,99,62,19,74,21.8,0.279,26,tested_negative --> 1
(349.) 5,0,80,32,0,41,0.346,37,tested_positive --> 0
(350.) 4,92,80,0,0,42.2,0.237,29,tested_negative --> 1
(351.) 4,137,84,0,0,31.2,0.252,30,tested_negative --> 1
(352.) 3,61,82,28,0,34.4,0.243,46,tested_negative --> 1
(353.) 1,90,62,12,43,27.2,0.58,24,tested_negative --> 1
(354.) 3,90,78,0,0,42.7,0.559,21,tested_negative --> 1
(355.) 9,165,88,0,0,30.4,0.302,49,tested_positive --> 0
(356.) 1,125,50,40,167,33.3,0.962,28,tested_positive --> 0
(357.) 13,129,0,30,0,39.9,0.569,44,tested_positive --> 0

(358.) 12,88,74,40,54,35.3,0.378,48,tested_negative --> 1
(359.) 1,196,76,36,249,36.5,0.875,29,tested_positive --> 0
(360.) 5,189,64,33,325,31.2,0.583,29,tested_positive --> 0
(361.) 5,158,70,0,0,29.8,0.207,63,tested_negative --> 1
(362.) 5,103,108,37,0,39.2,0.305,65,tested_negative --> 1
(363.) 4,146,78,0,0,38.5,0.52,67,tested_positive --> 0
(364.) 4,147,74,25,293,34.9,0.385,30,tested_negative --> 1
(365.) 5,99,54,28,83,34,0.499,30,tested_negative --> 1
(366.) 6,124,72,0,0,27.6,0.368,29,tested_positive --> 0
(367.) 0,101,64,17,0,21,0.252,21,tested_negative --> 1
(368.) 3,81,86,16,66,27.5,0.306,22,tested_negative --> 1
(369.) 1,133,102,28,140,32.8,0.234,45,tested_positive --> 0
(370.) 3,173,82,48,465,38.4,2.137,25,tested_positive --> 0
(371.) 0,118,64,23,89,0,1.731,21,tested_negative --> 1
(372.) 0,84,64,22,66,35.8,0.545,21,tested_negative --> 1
(373.) 2,105,58,40,94,34.9,0.225,25,tested_negative --> 1
(374.) 2,122,52,43,158,36.2,0.816,28,tested_negative --> 1
(375.) 12,140,82,43,325,39.2,0.528,58,tested_positive --> 0
(376.) 0,98,82,15,84,25.2,0.299,22,tested_negative --> 1
(377.) 1,87,60,37,75,37.2,0.509,22,tested_negative --> 1
(378.) 4,156,75,0,0,48.3,0.238,32,tested_positive --> 0
(379.) 0,93,100,39,72,43.4,1.021,35,tested_negative --> 1
(380.) 1,107,72,30,82,30.8,0.821,24,tested_negative --> 1
(381.) 0,105,68,22,0,20,0.236,22,tested_negative --> 1
(382.) 1,109,60,8,182,25.4,0.947,21,tested_negative --> 1
(383.) 1,90,62,18,59,25.1,1.268,25,tested_negative --> 1
(384.) 1,125,70,24,110,24.3,0.221,25,tested_negative --> 1
(385.) 1,119,54,13,50,22.3,0.205,24,tested_negative --> 1
(386.) 5,116,74,29,0,32.3,0.66,35,tested_positive --> 0
(387.) 8,105,100,36,0,43.3,0.239,45,tested_positive --> 0
(388.) 5,144,82,26,285,32,0.452,58,tested_positive --> 0

(389.) 3,100,68,23,81,31.6,0.949,28,tested_negative --> 1
(390.) 1,100,66,29,196,32,0.444,42,tested_negative --> 1
(391.) 5,166,76,0,0,45.7,0.34,27,tested_positive --> 0
(392.) 1,131,64,14,415,23.7,0.389,21,tested_negative --> 1
(393.) 4,116,72,12,87,22.1,0.463,37,tested_negative --> 1
(394.) 4,158,78,0,0,32.9,0.803,31,tested_positive --> 0
(395.) 2,127,58,24,275,27.7,1.6,25,tested_negative --> 1
(396.) 3,96,56,34,115,24.7,0.944,39,tested_negative --> 1
(397.) 0,131,66,40,0,34.3,0.196,22,tested_positive --> 0
(398.) 3,82,70,0,0,21.1,0.389,25,tested_negative --> 1
(399.) 3,193,70,31,0,34.9,0.241,25,tested_positive --> 0
(400.) 4,95,64,0,0,32,0.161,31,tested_positive --> 0
(401.) 6,137,61,0,0,24.2,0.151,55,tested_negative --> 1
(402.) 5,136,84,41,88,35,0.286,35,tested_positive --> 0
(403.) 9,72,78,25,0,31.6,0.28,38,tested_negative --> 1
(404.) 5,168,64,0,0,32.9,0.135,41,tested_positive --> 0
(405.) 2,123,48,32,165,42.1,0.52,26,tested_negative --> 1
(406.) 4,115,72,0,0,28.9,0.376,46,tested_positive --> 0
(407.) 0,101,62,0,0,21.9,0.336,25,tested_negative --> 1
(408.) 8,197,74,0,0,25.9,1.191,39,tested_positive --> 0
(409.) 1,172,68,49,579,42.4,0.702,28,tested_positive --> 0
(410.) 6,102,90,39,0,35.7,0.674,28,tested_negative --> 1
(411.) 1,112,72,30,176,34.4,0.528,25,tested_negative --> 1
(412.) 1,143,84,23,310,42.4,1.076,22,tested_negative --> 1
(413.) 1,143,74,22,61,26.2,0.256,21,tested_negative --> 1
(414.) 0,138,60,35,167,34.6,0.534,21,tested_positive --> 0
(415.) 3,173,84,33,474,35.7,0.258,22,tested_positive --> 0
(416.) 1,97,68,21,0,27.2,1.095,22,tested_negative --> 1
(417.) 4,144,82,32,0,38.5,0.554,37,tested_positive --> 0
(418.) 1,83,68,0,0,18.2,0.624,27,tested_negative --> 1
(419.) 3,129,64,29,115,26.4,0.219,28,tested_positive --> 0

(420.) 1,119,88,41,170,45.3,0.507,26,tested_negative --> 1
(421.) 2,94,68,18,76,26,0.561,21,tested_negative --> 1
(422.) 0,102,64,46,78,40.6,0.496,21,tested_negative --> 1
(423.) 2,115,64,22,0,30.8,0.421,21,tested_negative --> 1
(424.) 8,151,78,32,210,42.9,0.516,36,tested_positive --> 0
(425.) 4,184,78,39,277,37,0.264,31,tested_positive --> 0
(426.) 0,94,0,0,0,0,0.256,25,tested_negative --> 1
(427.) 1,181,64,30,180,34.1,0.328,38,tested_positive --> 0
(428.) 0,135,94,46,145,40.6,0.284,26,tested_negative --> 1
(429.) 1,95,82,25,180,35,0.233,43,tested_positive --> 0
(430.) 2,99,0,0,0,22.2,0.108,23,tested_negative --> 1
(431.) 3,89,74,16,85,30.4,0.551,38,tested_negative --> 1
(432.) 1,80,74,11,60,30,0.527,22,tested_negative --> 1
(433.) 2,139,75,0,0,25.6,0.167,29,tested_negative --> 1
(434.) 1,90,68,8,0,24.5,1.138,36,tested_negative --> 1
(435.) 0,141,0,0,0,42.4,0.205,29,tested_positive --> 0
(436.) 12,140,85,33,0,37.4,0.244,41,tested_negative --> 1
(437.) 5,147,75,0,0,29.9,0.434,28,tested_negative --> 1
(438.) 1,97,70,15,0,18.2,0.147,21,tested_negative --> 1
(439.) 6,107,88,0,0,36.8,0.727,31,tested_negative --> 1
(440.) 0,189,104,25,0,34.3,0.435,41,tested_positive --> 0
(441.) 2,83,66,23,50,32.2,0.497,22,tested_negative --> 1
(442.) 4,117,64,27,120,33.2,0.23,24,tested_negative --> 1
(443.) 8,108,70,0,0,30.5,0.955,33,tested_positive --> 0
(444.) 4,117,62,12,0,29.7,0.38,30,tested_positive --> 0
(445.) 0,180,78,63,14,59.4,2.42,25,tested_positive --> 0
(446.) 1,100,72,12,70,25.3,0.658,28,tested_negative --> 1
(447.) 0,95,80,45,92,36.5,0.33,26,tested_negative --> 1
(448.) 0,104,64,37,64,33.6,0.51,22,tested_positive --> 0
(449.) 0,120,74,18,63,30.5,0.285,26,tested_negative --> 1
(450.) 1,82,64,13,95,21.2,0.415,23,tested_negative --> 1

(451.) 2,134,70,0,0,28.9,0.542,23,tested_positive --> 0
(452.) 0,91,68,32,210,39.9,0.381,25,tested_negative --> 1
(453.) 2,119,0,0,0,19.6,0.832,72,tested_negative --> 1
(454.) 2,100,54,28,105,37.8,0.498,24,tested_negative --> 1
(455.) 14,175,62,30,0,33.6,0.212,38,tested_positive --> 0
(456.) 1,135,54,0,0,26.7,0.687,62,tested_negative --> 1
(457.) 5,86,68,28,71,30.2,0.364,24,tested_negative --> 1
(458.) 10,148,84,48,237,37.6,1.001,51,tested_positive --> 0
(459.) 9,134,74,33,60,25.9,0.46,81,tested_negative --> 1
(460.) 9,120,72,22,56,20.8,0.733,48,tested_negative --> 1
(461.) 1,71,62,0,0,21.8,0.416,26,tested_negative --> 1
(462.) 8,74,70,40,49,35.3,0.705,39,tested_negative --> 1
(463.) 5,88,78,30,0,27.6,0.258,37,tested_negative --> 1
(464.) 10,115,98,0,0,24,1.022,34,tested_negative --> 1
(465.) 0,124,56,13,105,21.8,0.452,21,tested_negative --> 1
(466.) 0,74,52,10,36,27.8,0.269,22,tested_negative --> 1
(467.) 0,97,64,36,100,36.8,0.6,25,tested_negative --> 1
(468.) 8,120,0,0,0,30,0.183,38,tested_positive --> 0
(469.) 6,154,78,41,140,46.1,0.571,27,tested_negative --> 1
(470.) 1,144,82,40,0,41.3,0.607,28,tested_negative --> 1
(471.) 0,137,70,38,0,33.2,0.17,22,tested_negative --> 1
(472.) 0,119,66,27,0,38.8,0.259,22,tested_negative --> 1
(473.) 7,136,90,0,0,29.9,0.21,50,tested_negative --> 1
(474.) 4,114,64,0,0,28.9,0.126,24,tested_negative --> 1
(475.) 0,137,84,27,0,27.3,0.231,59,tested_negative --> 1
(476.) 2,105,80,45,191,33.7,0.711,29,tested_positive --> 0
(477.) 7,114,76,17,110,23.8,0.466,31,tested_negative --> 1
(478.) 8,126,74,38,75,25.9,0.162,39,tested_negative --> 1
(479.) 4,132,86,31,0,28,0.419,63,tested_negative --> 1
(480.) 3,158,70,30,328,35.5,0.344,35,tested_positive --> 0
(481.) 0,123,88,37,0,35.2,0.197,29,tested_negative --> 1

(482.) 4,85,58,22,49,27.8,0.306,28,tested_negative --> 1
(483.) 0,84,82,31,125,38.2,0.233,23,tested_negative --> 1
(484.) 0,145,0,0,0,44.2,0.63,31,tested_positive --> 0
(485.) 0,135,68,42,250,42.3,0.365,24,tested_positive --> 0
(486.) 1,139,62,41,480,40.7,0.536,21,tested_negative --> 1
(487.) 0,173,78,32,265,46.5,1.159,58,tested_negative --> 1
(488.) 4,99,72,17,0,25.6,0.294,28,tested_negative --> 1
(489.) 8,194,80,0,0,26.1,0.551,67,tested_negative --> 1
(490.) 2,83,65,28,66,36.8,0.629,24,tested_negative --> 1
(491.) 2,89,90,30,0,33.5,0.292,42,tested_negative --> 1
(492.) 4,99,68,38,0,32.8,0.145,33,tested_negative --> 1
(493.) 4,125,70,18,122,28.9,1.144,45,tested_positive --> 0
(494.) 3,80,0,0,0,0,0.174,22,tested_negative --> 1
(495.) 6,166,74,0,0,26.6,0.304,66,tested_negative --> 1
(496.) 5,110,68,0,0,26,0.292,30,tested_negative --> 1
(497.) 2,81,72,15,76,30.1,0.547,25,tested_negative --> 1
(498.) 7,195,70,33,145,25.1,0.163,55,tested_positive --> 0
(499.) 6,154,74,32,193,29.3,0.839,39,tested_negative --> 1
(500.) 2,117,90,19,71,25.2,0.313,21,tested_negative --> 1
(501.) 3,84,72,32,0,37.2,0.267,28,tested_negative --> 1
(502.) 6,0,68,41,0,39,0.727,41,tested_positive --> 0
(503.) 7,94,64,25,79,33.3,0.738,41,tested_negative --> 1
(504.) 3,96,78,39,0,37.3,0.238,40,tested_negative --> 1
(505.) 10,75,82,0,0,33.3,0.263,38,tested_negative --> 1
(506.) 0,180,90,26,90,36.5,0.314,35,tested_positive --> 0
(507.) 1,130,60,23,170,28.6,0.692,21,tested_negative --> 1
(508.) 2,84,50,23,76,30.4,0.968,21,tested_negative --> 1
(509.) 8,120,78,0,0,25,0.409,64,tested_negative --> 1
(510.) 12,84,72,31,0,29.7,0.297,46,tested_positive --> 0
(511.) 0,139,62,17,210,22.1,0.207,21,tested_negative --> 1
(512.) 9,91,68,0,0,24.2,0.2,58,tested_negative --> 1

(513.) 2,91,62,0,0,27.3,0.525,22,tested_negative --> 1
(514.) 3,99,54,19,86,25.6,0.154,24,tested_negative --> 1
(515.) 3,163,70,18,105,31.6,0.268,28,tested_positive --> 0
(516.) 9,145,88,34,165,30.3,0.771,53,tested_positive --> 0
(517.) 7,125,86,0,0,37.6,0.304,51,tested_negative --> 1
(518.) 13,76,60,0,0,32.8,0.18,41,tested_negative --> 1
(519.) 6,129,90,7,326,19.6,0.582,60,tested_negative --> 1
(520.) 2,68,70,32,66,25,0.187,25,tested_negative --> 1
(521.) 3,124,80,33,130,33.2,0.305,26,tested_negative --> 1
(522.) 6,114,0,0,0,0,0.189,26,tested_negative --> 1
(523.) 9,130,70,0,0,34.2,0.652,45,tested_positive --> 0
(524.) 3,125,58,0,0,31.6,0.151,24,tested_negative --> 1
(525.) 3,87,60,18,0,21.8,0.444,21,tested_negative --> 1
(526.) 1,97,64,19,82,18.2,0.299,21,tested_negative --> 1
(527.) 3,116,74,15,105,26.3,0.107,24,tested_negative --> 1
(528.) 0,117,66,31,188,30.8,0.493,22,tested_negative --> 1
(529.) 0,111,65,0,0,24.6,0.66,31,tested_negative --> 1
(530.) 2,122,60,18,106,29.8,0.717,22,tested_negative --> 1
(531.) 0,107,76,0,0,45.3,0.686,24,tested_negative --> 1
(532.) 1,86,66,52,65,41.3,0.917,29,tested_negative --> 1
(533.) 6,91,0,0,0,29.8,0.501,31,tested_negative --> 1
(534.) 1,77,56,30,56,33.3,1.251,24,tested_negative --> 1
(535.) 4,132,0,0,0,32.9,0.302,23,tested_positive --> 0
(536.) 0,105,90,0,0,29.6,0.197,46,tested_negative --> 1
(537.) 0,57,60,0,0,21.7,0.735,67,tested_negative --> 1
(538.) 0,127,80,37,210,36.3,0.804,23,tested_negative --> 1
(539.) 3,129,92,49,155,36.4,0.968,32,tested_positive --> 0
(540.) 8,100,74,40,215,39.4,0.661,43,tested_positive --> 0
(541.) 3,128,72,25,190,32.4,0.549,27,tested_positive --> 0
(542.) 10,90,85,32,0,34.9,0.825,56,tested_positive --> 0
(543.) 4,84,90,23,56,39.5,0.159,25,tested_negative --> 1

(544.) 1,88,78,29,76,32,0.365,29,tested_negative --> 1
(545.) 8,186,90,35,225,34.5,0.423,37,tested_positive --> 0
(546.) 5,187,76,27,207,43.6,1.034,53,tested_positive --> 0
(547.) 4,131,68,21,166,33.1,0.16,28,tested_negative --> 1
(548.) 1,164,82,43,67,32.8,0.341,50,tested_negative --> 1
(549.) 4,189,110,31,0,28.5,0.68,37,tested_negative --> 1
(550.) 1,116,70,28,0,27.4,0.204,21,tested_negative --> 1
(551.) 3,84,68,30,106,31.9,0.591,25,tested_negative --> 1
(552.) 6,114,88,0,0,27.8,0.247,66,tested_negative --> 1
(553.) 1,88,62,24,44,29.9,0.422,23,tested_negative --> 1
(554.) 1,84,64,23,115,36.9,0.471,28,tested_negative --> 1
(555.) 7,124,70,33,215,25.5,0.161,37,tested_negative --> 1
(556.) 1,97,70,40,0,38.1,0.218,30,tested_negative --> 1
(557.) 8,110,76,0,0,27.8,0.237,58,tested_negative --> 1
(558.) 11,103,68,40,0,46.2,0.126,42,tested_negative --> 1
(559.) 11,85,74,0,0,30.1,0.3,35,tested_negative --> 1
(560.) 6,125,76,0,0,33.8,0.121,54,tested_positive --> 0
(561.) 0,198,66,32,274,41.3,0.502,28,tested_positive --> 0
(562.) 1,87,68,34,77,37.6,0.401,24,tested_negative --> 1
(563.) 6,99,60,19,54,26.9,0.497,32,tested_negative --> 1
(564.) 0,91,80,0,0,32.4,0.601,27,tested_negative --> 1
(565.) 2,95,54,14,88,26.1,0.748,22,tested_negative --> 1
(566.) 1,99,72,30,18,38.6,0.412,21,tested_negative --> 1
(567.) 6,92,62,32,126,32,0.085,46,tested_negative --> 1
(568.) 4,154,72,29,126,31.3,0.338,37,tested_negative --> 1
(569.) 0,121,66,30,165,34.3,0.203,33,tested_positive --> 0
(570.) 3,78,70,0,0,32.5,0.27,39,tested_negative --> 1
(571.) 2,130,96,0,0,22.6,0.268,21,tested_negative --> 1
(572.) 3,111,58,31,44,29.5,0.43,22,tested_negative --> 1
(573.) 2,98,60,17,120,34.7,0.198,22,tested_negative --> 1
(574.) 1,143,86,30,330,30.1,0.892,23,tested_negative --> 1

(575.) 1,119,44,47,63,35.5,0.28,25,tested_negative --> 1
(576.) 6,108,44,20,130,24,0.813,35,tested_negative --> 1
(577.) 2,118,80,0,0,42.9,0.693,21,tested_positive --> 0
(578.) 10,133,68,0,0,27,0.245,36,tested_negative --> 1
(579.) 2,197,70,99,0,34.7,0.575,62,tested_positive --> 0
(580.) 0,151,90,46,0,42.1,0.371,21,tested_positive --> 0
(581.) 6,109,60,27,0,25,0.206,27,tested_negative --> 1
(582.) 12,121,78,17,0,26.5,0.259,62,tested_negative --> 1
(583.) 8,100,76,0,0,38.7,0.19,42,tested_negative --> 1
(584.) 8,124,76,24,600,28.7,0.687,52,tested_positive --> 0
(585.) 1,93,56,11,0,22.5,0.417,22,tested_negative --> 1
(586.) 8,143,66,0,0,34.9,0.129,41,tested_positive --> 0
(587.) 6,103,66,0,0,24.3,0.249,29,tested_negative --> 1
(588.) 3,176,86,27,156,33.3,1.154,52,tested_positive --> 0
(589.) 0,73,0,0,0,21.1,0.342,25,tested_negative --> 1
(590.) 11,111,84,40,0,46.8,0.925,45,tested_positive --> 0
(591.) 2,112,78,50,140,39.4,0.175,24,tested_negative --> 1
(592.) 3,132,80,0,0,34.4,0.402,44,tested_positive --> 0
(593.) 2,82,52,22,115,28.5,1.699,25,tested_negative --> 1
(594.) 6,123,72,45,230,33.6,0.733,34,tested_negative --> 1
(595.) 0,188,82,14,185,32,0.682,22,tested_positive --> 0
(596.) 0,67,76,0,0,45.3,0.194,46,tested_negative --> 1
(597.) 1,89,24,19,25,27.8,0.559,21,tested_negative --> 1
(598.) 1,173,74,0,0,36.8,0.088,38,tested_positive --> 0
(599.) 1,109,38,18,120,23.1,0.407,26,tested_negative --> 1
(600.) 1,108,88,19,0,27.1,0.4,24,tested_negative --> 1
(601.) 6,96,0,0,0,23.7,0.19,28,tested_negative --> 1
(602.) 1,124,74,36,0,27.8,0.1,30,tested_negative --> 1
(603.) 7,150,78,29,126,35.2,0.692,54,tested_positive --> 0
(604.) 4,183,0,0,0,28.4,0.212,36,tested_positive --> 0
(605.) 1,124,60,32,0,35.8,0.514,21,tested_negative --> 1

(606.) 1,181,78,42,293,40,1.258,22,tested_positive --> 0
(607.) 1,92,62,25,41,19.5,0.482,25,tested_negative --> 1
(608.) 0,152,82,39,272,41.5,0.27,27,tested_negative --> 1
(609.) 1,111,62,13,182,24,0.138,23,tested_negative --> 1
(610.) 3,106,54,21,158,30.9,0.292,24,tested_negative --> 1
(611.) 3,174,58,22,194,32.9,0.593,36,tested_positive --> 0
(612.) 7,168,88,42,321,38.2,0.787,40,tested_positive --> 0
(613.) 6,105,80,28,0,32.5,0.878,26,tested_negative --> 1
(614.) 11,138,74,26,144,36.1,0.557,50,tested_positive --> 0
(615.) 3,106,72,0,0,25.8,0.207,27,tested_negative --> 1
(616.) 6,117,96,0,0,28.7,0.157,30,tested_negative --> 1
(617.) 2,68,62,13,15,20.1,0.257,23,tested_negative --> 1
(618.) 9,112,82,24,0,28.2,1.282,50,tested_positive --> 0
(619.) 0,119,0,0,0,32.4,0.141,24,tested_positive --> 0
(620.) 2,112,86,42,160,38.4,0.246,28,tested_negative --> 1
(621.) 2,92,76,20,0,24.2,1.698,28,tested_negative --> 1
(622.) 6,183,94,0,0,40.8,1.461,45,tested_negative --> 1
(623.) 0,94,70,27,115,43.5,0.347,21,tested_negative --> 1
(624.) 2,108,64,0,0,30.8,0.158,21,tested_negative --> 1
(625.) 4,90,88,47,54,37.7,0.362,29,tested_negative --> 1
(626.) 0,125,68,0,0,24.7,0.206,21,tested_negative --> 1
(627.) 0,132,78,0,0,32.4,0.393,21,tested_negative --> 1
(628.) 5,128,80,0,0,34.6,0.144,45,tested_negative --> 1
(629.) 4,94,65,22,0,24.7,0.148,21,tested_negative --> 1
(630.) 7,114,64,0,0,27.4,0.732,34,tested_positive --> 0
(631.) 0,102,78,40,90,34.5,0.238,24,tested_negative --> 1
(632.) 2,111,60,0,0,26.2,0.343,23,tested_negative --> 1
(633.) 1,128,82,17,183,27.5,0.115,22,tested_negative --> 1
(634.) 10,92,62,0,0,25.9,0.167,31,tested_negative --> 1
(635.) 13,104,72,0,0,31.2,0.465,38,tested_positive --> 0
(636.) 5,104,74,0,0,28.8,0.153,48,tested_negative --> 1

(637.) 2,94,76,18,66,31.6,0.649,23,tested_negative --> 1
(638.) 7,97,76,32,91,40.9,0.871,32,tested_positive --> 0
(639.) 1,100,74,12,46,19.5,0.149,28,tested_negative --> 1
(640.) 0,102,86,17,105,29.3,0.695,27,tested_negative --> 1
(641.) 4,128,70,0,0,34.3,0.303,24,tested_negative --> 1
(642.) 6,147,80,0,0,29.5,0.178,50,tested_positive --> 0
(643.) 4,90,0,0,0,28,0.61,31,tested_negative --> 1
(644.) 3,103,72,30,152,27.6,0.73,27,tested_negative --> 1
(645.) 2,157,74,35,440,39.4,0.134,30,tested_negative --> 1
(646.) 1,167,74,17,144,23.4,0.447,33,tested_positive --> 0
(647.) 0,179,50,36,159,37.8,0.455,22,tested_positive --> 0
(648.) 11,136,84,35,130,28.3,0.26,42,tested_positive --> 0
(649.) 0,107,60,25,0,26.4,0.133,23,tested_negative --> 1
(650.) 1,91,54,25,100,25.2,0.234,23,tested_negative --> 1
(651.) 1,117,60,23,106,33.8,0.466,27,tested_negative --> 1
(652.) 5,123,74,40,77,34.1,0.269,28,tested_negative --> 1
(653.) 2,120,54,0,0,26.8,0.455,27,tested_negative --> 1
(654.) 1,106,70,28,135,34.2,0.142,22,tested_negative --> 1
(655.) 2,155,52,27,540,38.7,0.24,25,tested_positive --> 0
(656.) 2,101,58,35,90,21.8,0.155,22,tested_negative --> 1
(657.) 1,120,80,48,200,38.9,1.162,41,tested_negative --> 1
(658.) 11,127,106,0,0,39,0.19,51,tested_negative --> 1
(659.) 3,80,82,31,70,34.2,1.292,27,tested_positive --> 0
(660.) 10,162,84,0,0,27.7,0.182,54,tested_negative --> 1
(661.) 1,199,76,43,0,42.9,1.394,22,tested_positive --> 0
(662.) 8,167,106,46,231,37.6,0.165,43,tested_positive --> 0
(663.) 9,145,80,46,130,37.9,0.637,40,tested_positive --> 0
(664.) 6,115,60,39,0,33.7,0.245,40,tested_positive --> 0
(665.) 1,112,80,45,132,34.8,0.217,24,tested_negative --> 1
(666.) 4,145,82,18,0,32.5,0.235,70,tested_positive --> 0
(667.) 10,111,70,27,0,27.5,0.141,40,tested_positive --> 0

(668.) 6,98,58,33,190,34,0.43,43,tested_negative --> 1
(669.) 9,154,78,30,100,30.9,0.164,45,tested_negative --> 1
(670.) 6,165,68,26,168,33.6,0.631,49,tested_negative --> 1
(671.) 1,99,58,10,0,25.4,0.551,21,tested_negative --> 1
(672.) 10,68,106,23,49,35.5,0.285,47,tested_negative --> 1
(673.) 3,123,100,35,240,57.3,0.88,22,tested_negative --> 1
(674.) 8,91,82,0,0,35.6,0.587,68,tested_negative --> 1
(675.) 6,195,70,0,0,30.9,0.328,31,tested_positive --> 0
(676.) 9,156,86,0,0,24.8,0.23,53,tested_positive --> 0
(677.) 0,93,60,0,0,35.3,0.263,25,tested_negative --> 1
(678.) 3,121,52,0,0,36,0.127,25,tested_positive --> 0
(679.) 2,101,58,17,265,24.2,0.614,23,tested_negative --> 1
(680.) 2,56,56,28,45,24.2,0.332,22,tested_negative --> 1
(681.) 0,162,76,36,0,49.6,0.364,26,tested_positive --> 0
(682.) 0,95,64,39,105,44.6,0.366,22,tested_negative --> 1
(683.) 4,125,80,0,0,32.3,0.536,27,tested_positive --> 0
(684.) 5,136,82,0,0,0,0.64,69,tested_negative --> 1
(685.) 2,129,74,26,205,33.2,0.591,25,tested_negative --> 1
(686.) 3,130,64,0,0,23.1,0.314,22,tested_negative --> 1
(687.) 1,107,50,19,0,28.3,0.181,29,tested_negative --> 1
(688.) 1,140,74,26,180,24.1,0.828,23,tested_negative --> 1
(689.) 1,144,82,46,180,46.1,0.335,46,tested_positive --> 0
(690.) 8,107,80,0,0,24.6,0.856,34,tested_negative --> 1
(691.) 13,158,114,0,0,42.3,0.257,44,tested_positive --> 0
(692.) 2,121,70,32,95,39.1,0.886,23,tested_negative --> 1
(693.) 7,129,68,49,125,38.5,0.439,43,tested_positive --> 0
(694.) 2,90,60,0,0,23.5,0.191,25,tested_negative --> 1
(695.) 7,142,90,24,480,30.4,0.128,43,tested_positive --> 0
(696.) 3,169,74,19,125,29.9,0.268,31,tested_positive --> 0
(697.) 0,99,0,0,0,25,0.253,22,tested_negative --> 1
(698.) 4,127,88,11,155,34.5,0.598,28,tested_negative --> 1

(699.) 4,118,70,0,0,44.5,0.904,26,tested_negative --> 1
(700.) 2,122,76,27,200,35.9,0.483,26,tested_negative --> 1
(701.) 6,125,78,31,0,27.6,0.565,49,tested_positive --> 0
(702.) 1,168,88,29,0,35,0.905,52,tested_positive --> 0
(703.) 2,129,0,0,0,38.5,0.304,41,tested_negative --> 1
(704.) 4,110,76,20,100,28.4,0.118,27,tested_negative --> 1
(705.) 6,80,80,36,0,39.8,0.177,28,tested_negative --> 1
(706.) 10,115,0,0,0,0,0.261,30,tested_positive --> 0
(707.) 2,127,46,21,335,34.4,0.176,22,tested_negative --> 1
(708.) 9,164,78,0,0,32.8,0.148,45,tested_positive --> 0
(709.) 2,93,64,32,160,38,0.674,23,tested_positive --> 0
(710.) 3,158,64,13,387,31.2,0.295,24,tested_negative --> 1
(711.) 5,126,78,27,22,29.6,0.439,40,tested_negative --> 1
(712.) 10,129,62,36,0,41.2,0.441,38,tested_positive --> 0
(713.) 0,134,58,20,291,26.4,0.352,21,tested_negative --> 1
(714.) 3,102,74,0,0,29.5,0.121,32,tested_negative --> 1
(715.) 7,187,50,33,392,33.9,0.826,34,tested_positive --> 0
(716.) 3,173,78,39,185,33.8,0.97,31,tested_positive --> 0
(717.) 10,94,72,18,0,23.1,0.595,56,tested_negative --> 1
(718.) 1,108,60,46,178,35.5,0.415,24,tested_negative --> 1
(719.) 5,97,76,27,0,35.6,0.378,52,tested_positive --> 0
(720.) 4,83,86,19,0,29.3,0.317,34,tested_negative --> 1
(721.) 1,114,66,36,200,38.1,0.289,21,tested_negative --> 1
(722.) 1,149,68,29,127,29.3,0.349,42,tested_positive --> 0
(723.) 5,117,86,30,105,39.1,0.251,42,tested_negative --> 1
(724.) 1,111,94,0,0,32.8,0.265,45,tested_negative --> 1
(725.) 4,112,78,40,0,39.4,0.236,38,tested_negative --> 1
(726.) 1,116,78,29,180,36.1,0.496,25,tested_negative --> 1
(727.) 0,141,84,26,0,32.4,0.433,22,tested_negative --> 1
(728.) 2,175,88,0,0,22.9,0.326,22,tested_negative --> 1
(729.) 2,92,52,0,0,30.1,0.141,22,tested_negative --> 1

(730.) 3,130,78,23,79,28.4,0.323,34,tested_positive --> 0
(731.) 8,120,86,0,0,28.4,0.259,22,tested_positive --> 0
(732.) 2,174,88,37,120,44.5,0.646,24,tested_positive --> 0
(733.) 2,106,56,27,165,29,0.426,22,tested_negative --> 1
(734.) 2,105,75,0,0,23.3,0.56,53,tested_negative --> 1
(735.) 4,95,60,32,0,35.4,0.284,28,tested_negative --> 1
(736.) 0,126,86,27,120,27.4,0.515,21,tested_negative --> 1
(737.) 8,65,72,23,0,32,0.6,42,tested_negative --> 1
(738.) 2,99,60,17,160,36.6,0.453,21,tested_negative --> 1
(739.) 1,102,74,0,0,39.5,0.293,42,tested_positive --> 0
(740.) 11,120,80,37,150,42.3,0.785,48,tested_positive --> 0
(741.) 3,102,44,20,94,30.8,0.4,26,tested_negative --> 1
(742.) 1,109,58,18,116,28.5,0.219,22,tested_negative --> 1
(743.) 9,140,94,0,0,32.7,0.734,45,tested_positive --> 0
(744.) 13,153,88,37,140,40.6,1.174,39,tested_negative --> 1
(745.) 12,100,84,33,105,30,0.488,46,tested_negative --> 1
(746.) 1,147,94,41,0,49.3,0.358,27,tested_positive --> 0
(747.) 1,81,74,41,57,46.3,1.096,32,tested_negative --> 1
(748.) 3,187,70,22,200,36.4,0.408,36,tested_positive --> 0
(749.) 6,162,62,0,0,24.3,0.178,50,tested_positive --> 0
(750.) 4,136,70,0,0,31.2,1.182,22,tested_positive --> 0
(751.) 1,121,78,39,74,39,0.261,28,tested_negative --> 1
(752.) 3,108,62,24,0,26,0.223,25,tested_negative --> 1
(753.) 0,181,88,44,510,43.3,0.222,26,tested_positive --> 0
(754.) 8,154,78,32,0,32.4,0.443,45,tested_positive --> 0
(755.) 1,128,88,39,110,36.5,1.057,37,tested_positive --> 0
(756.) 7,137,90,41,0,32,0.391,39,tested_negative --> 1
(757.) 0,123,72,0,0,36.3,0.258,52,tested_positive --> 0
(758.) 1,106,76,0,0,37.5,0.197,26,tested_negative --> 1
(759.) 6,190,92,0,0,35.5,0.278,66,tested_positive --> 0
(760.) 2,88,58,26,16,28.4,0.766,22,tested_negative --> 1

(761.) 9,170,74,31,0,44,0.403,43,tested_positive --> 0
(762.) 9,89,62,0,0,22.5,0.142,33,tested_negative --> 1
(763.) 10,101,76,48,180,32.9,0.171,63,tested_negative --> 1
(764.) 2,122,70,27,0,36.8,0.34,27,tested_negative --> 1
(765.) 5,121,72,23,112,26.2,0.245,30,tested_negative --> 1
(766.) 1,126,60,0,0,30.1,0.349,47,tested_positive --> 0
(767.) 1,93,70,31,0,30.4,0.315,23,tested_negative --> 1
Time taken to build model (full training data) : 0.27 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 268 ( 35%)
1 500 ( 65%)
6. Aim:
To demonstrate the association rule mining on supermarket dataset using APRIORI
Algorithm with different support and confidence thresholds.
APRIORI Algorithm:
With the quick growth in e-commerce applications, there is an accumulation vast quantity of
data in months not in years. Data Mining, also known as Knowledge Discovery in
Databases(KDD), to find anomalies, correlations, patterns, and trends to predict outcomes.
Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent
itemsets and relevant association rules. It is devised to operate on a database containing a lot
of transactions, for instance, items brought by customers in a store.
It is very important for effective Market Basket Analysis and it helps the customers in
purchasing their items with more ease which increases the sales of the markets. It has also
been used in the field of healthcare for the detection of adverse drug reactions. It produces
association rules that indicates what all combinations of medications and patient
characteristics lead to ADRs.
Procedure:
STEP 1: Choose the data file required to association.
STEP 2: Then select association option in the menu.
STEP 3: Choose the association algorithm.
STEP 4: Select the APRIORI algorithm and then click Start.
STEP 5: Then the data file is processed and the resultant is displayed in association output.
Output:
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1

Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.15 (694 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17
Generated sets of large itemsets:
Size of set of large itemsets L(1): 44

Best rules found:
1. biscuits=t frozen foods=t fruit=t total=high 788 ==> bread and cake=t 723
<conf:(0.92)> lift:(1.27)
lev:(0.03) [155] conv:(3.35)
2. baking needs=t biscuits=t fruit=t total=high 760 ==> bread and cake=t 696
<conf:(0.92)> lift:(1.27)
lev:(0.03) [149] conv:(3.28)
3. baking needs=t frozen foods=t fruit=t total=high 770 ==> bread and cake=t 705
<conf:(0.92)>
lift:(1.27) lev:(0.03) [150] conv:(3.27)
4. biscuits=t fruit=t vegetables=t total=high 815 ==> bread and cake=t 746
<conf:(0.92)> lift:(1.27)
lev:(0.03) [159] conv:(3.26)
5. party snack foods=t fruit=t total=high 854 ==> bread and cake=t 779
<conf:(0.91)> lift:(1.27)
lev:(0.04) [164] conv:(3.15)
6. biscuits=t frozen foods=t vegetables=t total=high 797 ==> bread and cake=t 725
<conf:(0.91)>
lift:(1.26) lev:(0.03) [151] conv:(3.06)
7. baking needs=t biscuits=t vegetables=t total=high 772 ==> bread and cake=t 701
<conf:(0.91)>
lift:(1.26) lev:(0.03) [145] conv:(3.01)
8. biscuits=t fruit=t total=high 954 ==> bread and cake=t 866 <conf:(0.91)>
lift:(1.26) lev:(0.04) [179]
conv:(3)
9. frozen foods=t fruit=t vegetables=t total=high 834 ==> bread and cake=t 757
<conf:(0.91)>
lift:(1.26) lev:(0.03) [156] conv:(3)
10. frozen foods=t fruit=t total=high 969 ==> bread and cake=t 877 <conf:(0.91)>
lift:(1.26) lev:(0.04)
[179] conv:(2.92)
Aim:
To demonstrate the association rule mining on supermarket dataset using the FP-GROWTH
Algorithm with different support and confidence thresholds.
FP-GROWTH Algorithm:
The FP-Growth Algorithm is an alternative way to find frequent itemsets without using
candidate generations, thus improving performance. For so much it uses a divide-and-
conquer strategy. The core of this method is the usage of a special data structure named
frequent-pattern tree (FP-tree), which retains the itemset association information.
In simple words, this algorithm works as follows: first it compresses the input database
creating an FP-tree instance to represent frequent items. After this first step it divides the
compressed database into a set of conditional databases, each one associated with one
frequent pattern. Finally, each such database is mined separately. Using this strategy, the FP-
Growth reduces the search costs looking for short patterns recursively and then concatenating
them in the long frequent patterns, offering good selectivity.
In large databases, it’s not possible to hold the FP-tree in the main memory. A strategy to
cope with this problem is to firstly partition the database into a set of smaller databases
(called projected databases), and then construct an FP-tree from each of these smaller
databases.
Procedure:
STEP 1: Choose the data file required to association.
STEP 2: Then select association option in the menu.
STEP 3: Choose the association algorithm.
STEP 4: Select the FP-GROWTH algorithm and then click Start.
STEP 5: Then the data file is processed and the resultant is displayed in association output.
Output:
Run information
Scheme: weka.associations.FPGrowth -P 2 -I -1 -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1
Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
Associator model (full training set)
FPGrowth found 16 rules (displaying top 10)
1. [fruit=t, frozen foods=t, biscuits=t, total=high]: 788 ==> [bread and cake=t]: 723
<conf:(0.92)> lift:(1.27) lev:(0.03) conv:(3.35)
2. [fruit=t, baking needs=t, biscuits=t, total=high]: 760 ==> [bread and cake=t]: 696
3. [fruit=t, baking needs=t, frozen foods=t, total=high]: 770 ==> [bread and cake=t]: 705
4. [fruit=t, vegetables=t, biscuits=t, total=high]: 815 ==> [bread and cake=t]: 746
5. [fruit=t, party snack foods=t, total=high]: 854 ==> [bread and cake=t]: 779 <conf:(0.91)>
lift:(1.27) lev:(0.04) conv:(3.15)
6. [vegetables=t, frozen foods=t, biscuits=t, total=high]: 797 ==> [bread and cake=t]: 725
7. [vegetables=t, baking needs=t, biscuits=t, total=high]: 772 ==> [bread and cake=t]: 701
8. [fruit=t, biscuits=t, total=high]: 954 ==> [bread and cake=t]: 866 <conf:(0.91)>
lift:(1.26) lev:(0.04) conv:(3)
9. [fruit=t, vegetables=t, frozen foods=t, total=high]: 834 ==> [bread and cake=t]: 757
<conf:(0.91)> lift:(1.26) lev:(0.03) conv:(3)
10. [fruit=t, frozen foods=t, total=high]: 969 ==> [bread and cake=t]: 877 <conf:(0.91)>
lift:(1.26) lev:(0.04) conv:(2.92)
INTRODUCTION TO “R”:
R is an integrated suite of software facilities for data manipulation, calculation and
graphical display. Among other things it has
● An effective data handling and storage facility.
● A suite of operators for calculations on arrays, in particular matrices,
● A large, coherent, integrated collection of intermediate tools for data analysis
● A well developed, simple and effective programming language(called ‘S’) which
includes conditions, loops, user defined recursive functions and input and output
facilities.
The term “environment” is intended to characterize it as a fully planned and coherent
system, rather than an increased accretion of very specific and inflexible tools, as is
frequently the case with other data analysis software.
R is very much a vehicle for newly developing methods of interactive data analysis. It has
developed rapidly, and has been extended by a large collection of packages. However,
most programs written in R are essentially ephemeral, written for a single piece of data
analysis.
You will type R commands into the R console in order to carry out analysis in R. In the R
console you will see:
>
This is the R prompt. We type the commands needed for a particular task after this
prompt. The command is carried out after you hit the Return key.
Once you have started R, you can start typing in commands, and the results will be
calculated immediately, for example:
>2*3
[1] 6
>10-3
[1] 7
All variables (scalars, vectors, matrices, etc.,) created by R are called objects. In R, we
assign values to variables using an arrow. For example, we can assign the values 2*3 to the
variable x using the command:
>X<-2*3
To view the contents of any R object, just type its name, and the contents of that R object
be displayed:
>x
[1] 6
There are several possible different types of objects in R, including scalars, vectors,
matrices, arrays, data frames, tables and lists. The scalar variable x above is one example
of an R object. While a scalar variable such as x has just one element, a vector consists of
several elements. The elements in a vector all of the same type (example: numeric or
characters), while lists may include elements such as characters as well as numeric
quantities.
To create a vector, we can use the c() (combine) function. For example, to create a vector
called myvector that has elements such as characters as well as numeric quantities with
values 8, 6, 9, 10 and 5, we type:
>myvector<-c(8, 6, 9, 10, 5)
To see the contents of the variable myvector, we can just type its name:
>myvector[4]
[1] 10
In contrast to a vector, a list can contain elements of different types, for example, both
numeric and character elements. A list can also include other variables such as a vector.
The list() function is used to create a list. For example, we could create a list mylist by
typing:
>mylist<-list(name=”Fred”, wife=”Mary”, myvector)
We can then print out the contents of the list mylist by typing its name:
>mylist
$name
[1] “Fred”
$wife
[1]”Wife”
[[3]]
[1] 8 6 9 10 5
The elements in the list are numbered, and can be referred to using indices. We can
extract an element of a list by typing the list name with the index of the element given in
double square brackets (in contrast to a vector, where we only use single square brackets).
Thus, we can extract the second and third elements from mylist by typing:
>mylist[[2]]
[1]”Mary”
[1]8 6 9 10 5
Elements of lists may also be named, and in this case the elements may be referred to by
giving the list name, followed by “$”, followed by the element name. For example, mylist
$name is same as mylist[[1]] and mylist$wife is the same as mylist[[2]]
>mylist$wife
[1] “Mary”
We can find out the names of the named elements in a list by using the attributes()
function, for example:
>attributes(mylist)
$names
[1] “name” ”wife” “”
When you use the attributes() function to find the named elements of a list variable, the
named elements are always listed under a heading “$names”. Therefore, we see that the
named elements of the list variable mylist are called “name” and “wife”, and we can
retrieve their values by typing mylist$name and mylist$wife, respectively.
>mynames<-c(“Mary”, “John”, “Ann”, “Sinead”, “Joe”, “Mary”, “Jim”, “John”,
“Simon”)
>table(myname)
mynames
Ann Jim Joe John Mary Simon Sinead
1 1 1 2 2 1 1
CRAN REPOSITORY:
The capabilities of R are extended through user-created packages, which allows
specialized statistical techniques, graphical devices, import/export capabilities, reporting
tools (Knitr, Sweave), etc. These packages are developed primarily in R, and sometimes
in Java, C, C++ and Fortan. The R packaging system is also used by researchers to create
compendia to organize research data, code and report files in a systematic way for sharing
and public archiving.
CRAN task views aim to provide some guidance which packages on CRAN are
relevant for tasks related to a certain topic. They give a brief overview of the included
packages and can be automatically installed using the ctv package. The views are
intended to have a sleep focus so that it is sufficiently clear which packages should be
included – they are not meant to endorse the “best” packages for a given task.
● To automatically install the views, the ctv packages need to be installed e.g,
Install.package(“ctv”)
and then the views can be installed.views or update.views e.g.,
ctv::install.views(“Economics”)
ctv::update.views(“Economics”)
INSTALLATION OF R:
Installing R on a Windows PC
To install R on your Windows computer, follow these steps:
1. Go to http://ftp.heanet.ie/mirrors/cran.r-project.org.
2. Under “Download and Install R”, click on the “Windows” link.
3. Under “Subdirectories”, click on the “base” link.
4. On the next page, you should see a link saying something like “Download R 2.10.1 for Windows”
(or R X.X.X, where X.X.X gives the version of R, eg. R 2.11.1). Click on this link.
5. You may be asked if you want to save or run a file “R-2.10.1-win32.exe”. Choose “Save” and
save the file on the Desktop. Then double-click on the icon for the file to run it.
6. You will be asked what language to install it in - choose English.
7. The R Setup Wizard will appear in a window. Click “Next” at the bottom of the R Setup wizard
window.
8. The next page says “Information” at the top. Click “Next” again.
9. The next page says “Information” at the top. Click “Next” again.
10. The next page says “Select Destination Location” at the top. By default, it will suggest to install R
in “C:\Program Files” on your computer.
11. Click “Next” at the bottom of the R Setup wizard window.
12. The next page says “Select components” at the top. Click “Next” again.
13. The next page says “Startup options” at the top. Click “Next” again.
14. The next page says “Select start menu folder” at the top. Click “Next” again.
15. The next page says “Select additional tasks” at the top. Click “Next” again.
16. R should now be installed. This will take about a minute. When R has finished, you will see
“Completing the R for Windows Setup Wizard” appear. Click “Finish”.
17. To start R, you can either follow step 18, or 19:
18. Check if there is an “R” icon on the desktop of the computer that you are using. If so, double-click
on the “R” icon to start R. If you cannot find an “R” icon, try step 19 instead.
19. Click on the “Start” button at the bottom left of your computer screen, and then choose “All
programs”, and start R by selecting “R” (or R X.X.X, where X.X.X gives the version of R, eg. R
2.10.0) from the menu of programs.
20. The R console (a rectangle) should pop up:
How to install R on non-Windows computers (eg. Macintosh or Linux computers)
The instructions above are for installing R on a Windows PC. If you want to install R on a computer that
has a non-Windows operating system (for example, a Macintosh or computer running Linux, you should
download the appropriate R installer for that operating system at http://ftp.heanet.ie/mirrors/cran.r-
project.org and follow the R installation instructions for the appropriate operating system
at http://ftp.heanet.ie/mirrors/cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-R-be-installed_003f).
Installing R packages
R comes with some standard packages that are installed when you install R. However, in this booklet I will
also tell you how to use some additional R packages that are useful, for example, the “rmeta” package.
These additional packages do not come with the standard installation of R, so you need to install them
yourself.
How to install an R package

Once you have installed R on a Windows computer (following the steps above), you can install an
additional package by following the steps below:
1. To start R, follow either step 2 or 3:
2. Check if there is an “R” icon on the desktop of the computer that you are using. If so, double-click
on the “R” icon to start R. If you cannot find an “R” icon, try step 3 instead.
3. Click on the “Start” button at the bottom left of your computer screen, and then choose “All
programs”, and start R by selecting “R” (or R X.X.X, where X.X.X gives the version of R, eg. R
2.10.0) from the menu of programs.
4. The R console (a rectangle) should pop up.
5. Once you have started R, you can now install an R package (eg. the “rmeta” package) by choosing
“Install package(s)” from the “Packages” menu at the top of the R console. This will ask you what
website you want to download the package from, you should choose “Ireland” (or another country,
if you prefer). It will also bring up a list of available packages that you can install, and you should
choose the package that you want to install from that list (eg. “rmeta”).
6. This will install the “rmeta” package.
7. The “rmeta” package is now installed. Whenever you want to use the “rmeta” package after this,
after starting R, you first have to load the package by typing into the R console:
>library("rmeta")
MENU BAR OF R CONSOLE:

Anytime you use RStudio, you will see the menu bar across the top of the screen. Composed
of eleven drop-down menus, the Menu Bar will assist you in executing the proper commands
in RStudio. To learn more about the specific functions of each drop-down menu, read the
following sections.
File
The RStudio File menu contains many of the standard functions of a File menu in any other
software or program – New File, Open File, Save, Print. One important feature to mention in
the RStudio File menu is the command Knit Document. Knitting a Document converts your
RStudio file into an HTML file, a PDF document, or a Microsoft Word document. This
makes the work that you have done easy to be read and shared in a variety of settings. The
File menu additionally allows users to Import Datasets from outside softwares or programs.
Edit
The Edit menu will likely be a frequently used menu for RStudio users. Here users
can Cut, Copy, Paste, Undo, and Redo. The Edit menu is also very helpful in locating code or
commands previously used. Featuring Go to Line…, Find…, and Replace and Find, users can
quickly edit or replace RStudio code. The Edit menu additionally has the command Clear
Console which allows users to wipe the Console clean. Clear Console does not
affectthe Source, Environment/History, or Miscellaneous tabs.
Code
The Code menu has commands used for working with code. Similar to the Edit menu, Code
offers Jump To… for quick access to a specific code. Here you can rework the appearance of
your code with Reformat Code,and where functions and variables may be removed
with Extract Function and Extract Variable. The Code menu also provides commands for
running the code with Run Selected Line(s), Re-Run Previous, and Run Region. RStudio code
may be run directly in the Source tab by selecting either the green arrow with each code
chunk or with the Run button at the top right of the Source tab, or users may specify which
code they wish to run with one of the commands from the Code menu.
View
The View menu is focused on how the user sees their RStudio workspace. This menu allows
you to choose which tab you want to view, as well as where the mouse should be
focused. Panes focuses on zoom abilities of RStudio and allows users to zoom in on a
specific tab.
Plots
The Plots menu works specifically with plots that you have made in RStudio. This menu
allows you to quickly switch between plots and zoom on the plot to enhance clarity. This is
where you can choose to save your plot as either an image or a PDF, as well as where you
may delete unwanted plots with Remove Plot….
Session
The Session menu allows users to open a New Session… and Quit Session…. Here you may
also Terminate R…if you are completely finished with using RStudio, or Restart R if an
update needs to occur or the program needs to be refreshed.
Build
The Build menu features only one command: Configure Build Tools. Build Tools may only
be configured inside of an RStudio project and are a way to package and distribute R code.
Debug
The Debug menu is what ensures your code is working properly. When there is an error in
your coding or commands, the Debug menu will point out the errors and allow you to decide
if you wish to keep working with your RStudio code. You may choose how you wish to be
notified of an error by using the option On Error. If you need additional help fixing an error,
select DebuggingHelp.
Profile
The Profile menu allows users to better understand what exactly RStudio is doing . Profiling
provides a graphical interface so that users may see what RStudio is working on in moments
when you are waiting for an output to appear. By profiling, users can learn what is slowing
their code down so that they may tweak parts in order to make the code run faster.
Tools
The Tools menu provides information on the current version of RStudio being run, as well as
being the location where Packages and Addins may be installed. Tools also assists RStudio
users with Keyboard Shortcuts, and allows users to Modify Keyboard Shortcuts to cater to
individual’s needs and preferences.
Help
The Help menu provides information to assist users maximize their RStudio proficiency.
Direct links to RStudio Help are provided, as well as Cheatsheets made by RStudio
professionals and a section on Diagnostics to allow the user to see what is occurring in
RStudio.
BASIC PLOTTING:
HISTOGRAM:
It is a diagram consisting of rectangles where area is proportional to the frequency of

a variable and whose width is equal to the class interval.
In R: To plot a histogram of the data use ‘hist’ command
>hist(w1$vals)
Where : w1 is data frame
BAR PLOTS:
A bar chart represents data in rectangular bars with length of the bar proportional to
the value of the variable .R uses the function barplot() to create bar charts
Syntax: barplot(H, Xlab, Ylab, main, names.arg, col)
DOT PLOTS:
A dot plot is a statistical chart containing of data points plotted on a fairly simple
scale, typically using filled in circles.
Syntax: dotchart(X, Labels, graphs, gcolor, color)
SCATTER PLOTS:
Scatter plots show many points plotted in the Cartesian plane. Each point represents
the value of two variables. One variable is chosen in the horizontal axis and another in the
vertical axis.
We use plot() function in R.
Syntax: plot(x, y, main, xlab, ylab, xlim, ylim, axes)
BOX PLOTS:
Box plots are a measure of how well distributed is the data in a data set. It divides the
data set into 3 quartiles. This graph represents the minimum, maximum, median, first quartile
and third quartile in the data set.
We use boxplot() function in R.
Syntax: boxplot(x, data, notch, varwidth, names, main)
LINE GRAPHS:
A line graph is a graph that contains a series of points by drawing line segments
between them. These points are ordered in one of their coordinate value. Line charts are
usually used in identifying the trends in the data.
We use plot() function in R.
Syntax: plot(v, type, col, xlab, ylab)
IRIS DATA SET
The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples
of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). These measures were
used to create a linear discriminant model to classify the species. The dataset is often used in
data mining, classification and clustering examples and to test algorithms
Attribute Information:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
1)Aim:
Load the 'iris.CSV' file and display the names and type of each column. Find statistics
such as min, max,range,mean,median,variance,standard deviation for each each column of
data.
i)Loading CSV file :
The sample data can also be in comma separated values (CSV) format. Each cell inside
such data file is separated by a special character, which usually is a comma, although other
characters can be used as well.
The first row of the data file should contain the column names instead of the actual data. Here
is a sample of the expected format.
Col1,Col2,Col3
100,a1,b1
200,a2,b2
300,a3,b3
Syntax :mydata = read.csv("mydata.csv")
ii)head() function :
Returns the first part of a vector, matrix, table, data frame or function.
Syntax:head(x, n=6)
▪ x – A matrix, data frame, or vector.
▪ n – The first n rows (or values if x is a vector) will be returned.
iii)names() function :
Functions to get or set the names of an object.
Syntax: name(x)
name(x) <- value
● x-An Object
● value-a character vector of same length as x or NULL
iv)View() function:
Invoke a spreadsheet-style data viewer on a matrix-like R object.
Syntax:View(x, title)
v)Mean:
It is calculated by taking the sum of the values and dividing with the number of values in a data
series.
The function mean() is used to calculate this in R.

Syntax:mean(x, trim = 0, na.rm = FALSE, ...)
● x is the input vector.
● trim is used to drop some observations from both end of the sorted vector.
● na.rm is used to remove the missing values from the input vector.
vi)Median:
The middle most value in a data series is called the median. The median()function is used in R to
calculate this value.
Syntax:median(x, na.rm = FALSE)
● x is the input vector.
● na.rm is used to remove the missing values from the input vector.
vii)Range :
Returns a vector containing the minimum and maximum of all the given arguments.
Syntax:range(…, na.rm = FALSE, finite = FALSE)
● … any numeric or character objects.
● na.rmlogical, indicating if NA's should be omitted.
● Finite logical, indicating if all non-finite elements should be omitted.
viii)Max and Min :

max() function computes the maximun value of a vector. min() function computes the
minimum value of a vector.
Syntax:
max(x,na.rm=FALSE)
min(x,na.rm=FALSE)
• x: number vector
• na.rm: whether NA should be removed, if not, NA will be returned
ix)Variance :
The variance is a numerical measure of how the data values is dispersed around the mean.
Syntax:var(x)
x)Standard Variation :
A measure that is used to quantify the amount of variation or dispersion of a set of data
values.
Syntax :sqrt(var(x))
xi)nrow and ncol :

nrow and ncol return the number of rows or columns present in x. NCOL and NROW do the
same treating a vector as 1-column matrix.
Syntax:nrow(x) , ncol(x)
x-a vector, array or data frame

#step1:Load the iris dataset
>data("iris")
head(iris) ## This displays the first six rows

>>head(iris)
Sepal.LengthSepal.WidthPetal.LengthPetal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# step -2 finding out dimensions of a datset
dim(iris)
>dim(iris)
[1] 150 5
# step-3 Display names of each column in dataset.
names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

# step-4 displaying our original datset
View(iris)
#step-5
#' Get basic statistics of mean and standard deviation,min,max,range,
#' variance for all columns,do it yourself example:
mean(iris$Sepal.Length)
>mean(iris$Sepal.Length)
[1] 5.843333
sd(iris$Sepal.Length)
>sd(iris$Sepal.Length)
[1] 0.8280661
min(iris$Sepal.Length)
>min(iris$Sepal.Length)
[1] 4.3
min(iris$Sepal.Width)
>min(iris$Sepal.Width)
[1] 2
range(iris$Sepal.Length)
>range(iris$Sepal.Length)
[1] 4.3 7.9
max(iris$Sepal.Width)
>max(iris$Sepal.Width)
[1] 4.4
mean(iris$Petal.Length)
>mean(iris$Petal.Length)
[1] 3.758
sd(iris$Petal.Length)
>sd(iris$Petal.Length)
[1] 1.765298
var(iris$Sepal.Length)
>var(iris$Sepal.Length)
[1] 0.6856935
nrow(iris)
>nrow(iris)
[1] 150
ncol(iris)
>ncol(iris)
[1] 5
colnames(iris)
>colnames(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
sapply(iris,typeof)
>sapply(iris,typeof)
Sepal.LengthSepal.WidthPetal.LengthPetal.Width Species
"double" "double" "double" "double" "integer"
summary(iris)
2)Aim:
To write R program to normalize the variables into 0 (zero) to 1 (one) scale using
min-max normalization.
Normalization:
Normalization is a database design technique which organizes tables in a manner that reduces
redundancy and dependency of data. It divides larger tables into smaller tables and links them
using relationships.
Use of Normalization:
● Database designing is critical to the successful implementation of a database
management system that meets the data requirements of an enterprise system.
● Normalization helps produce database systems that are cost-effective and have better
security models.
● Functional dependencies are a very important component of the normalize data
process
● Most database systems are normalized database up to the third normal forms.
● A primary key uniquely identifies are record in a Table and cannot be null
● A foreign key helps connect table and references a primary key
Min-Max Normalization:
𝑥𝑥−𝑥𝑥𝑥⁡(𝑥)
Zi=𝑥𝑥𝑥⁡(𝑥)⁡−𝑥𝑥𝑥⁡(𝑥)
Min Max is a technique that is used to normalize the data. It will scale the data between 0 and
1. Where xi is the ith data point and min represents the minimum and max represents
maximum. So xi converts to Zi.
Syntax:
1. lapply(X,FUN,…..) :
lapply returns a list of the same length as X, each element of which is the
result of applying FUN to the corresponding element of X.Where, X is a
vector or an expression object and FUN is the function to be applied to each
element of X.
2. as.data.frame :
Functions to check if an object is dataframe.as.data.frame returns a data frame
, normally with all rows.
Program:
min_max_normalizer<-function(x)
{
num<- x-min(x)
denom<-max(x)-min(x)
return(num/denom)
}
normalized_iris<-as.data.frame(lapply(iris[1:4],min_max_normalizer))
summary(normalized_iris)
Output:
3)Aim:
Generate histograms for any one variable (sepal length/sepal width/petal length/petal width)
and generate scatter plots for every pair of variables showing each species in different colour.
Histogram :
A histogram is a display of statistical information that uses rectangles to show the frequency
of data items in successive numerical intervals of equal size.
Syntax:
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −

● v is a vector containing numeric values used in histogram.
● main indicates title of the chart.
● col is used to set color of the bars.
● border is used to set border color of each bar.
● xlab is used to give description of x-axis.
● xlim is used to specify the range of values on the x-axis.
● ylim is used to specify the range of values on the y-axis.
● breaks is used to mention the width of each bar.
Scatter Plot :
A graph of plotted points that show the relationship between two sets of data.
Syntax:
plot(x, y = NULL, type = "p", xlim = NULL, ylim = NULL, log = "", main = NULL,
sub = NULL, xlab = NULL, ylab = NULL, ann = par("ann"), axes = TRUE,
frame.plot = axes, panel.first = NULL, panel.last = NULL, asp = NA,..)
There are many arguments supported by the Scatter Plot in R programming language,
and following are some of the arguments in real-time:
● x, y: Please specify the data sets you want to compare. Here, you can use two
seperatevectors, or Matrix with columns, or lists.
● type: Please specify, what type of plot you want to draw.
● To draw Points use type = “p”
● To draw Lines use type = “l”
● Use type = “h” for Histograms
● Use type = “s” for stair steps
● To draw over-plotted use type = “o”
sub: You can provide the subtitle (if any) for your scatter plot.
log: You have to specify a character string of three options. If X-Axis is to be
logarithmic then “x”, If Y-Axis is to be logarithmic then “y”, if both X-Axis and Y-
Axis are to be logarithmic then specify either “xy” or “yx”
axes: It is a Boolean argument. If it is TRUE, axis is drawn.
frame.plot: It is a Boolean argument that specifies, whether a box should be drawn
around the plot or not.
panel.first: Please specify an expression that is evaluated after the axes is drawn but
before the points are plotted.
panel.last: Please specify an expression that is evaluated after the points are plotted.
asp: Please specify the aspect ratio of the plot (as y/x).
ggplot2:
The ggplot2 package, created by Hadley Wickham, offers a powerful graphics language for
creating elegant and complex plots.
Advantages of ggplot2:
● consistent underlying grammar of graphics (Wilkinson, 2005)
● plot specification at a high level of abstraction
● very flexible
● theme system for polishing plot appearance
● mature and complete graphics system
● many users, active mailing list
ggplot():
itinitializes a ggplot object. It can be used to declare the input data frame for a graphic and to
specify the set of plot aesthetics intended to be common throughout all subsequent layers
unless specifically overridden.
ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
Arguments
data
Default dataset to use for plot. If not already a data.frame, will be converted to one
by fortify. If not specified, must be suppled in each layer added to the plot.
mapping
Default list of aesthetic mappings to use for plot. If not specified, must be suppled in
each layer added to the plot.Other arguments passed on to methods. Not currently used.
environment
If an variable defined in the aesthetic mapping is not found in the data, ggplot will
look for it in this environment. It defaults to using the environment in
which ggplot() is called
Program:
hist(iris$Sepal.Length, col="green")
hist(iris$Sepal.Width, col="red")
hist(iris$Petal.Length, col="blue")
hist(iris$Petal.Width, col="green")
hist(iris$Petal.Length)
hist(iris$Petal.Length,xlim=c(0,8),ylim=c(0,20))
install.packages("ggplot2")
library(ggplot2)
ggplot(iris,aes(x=Petal.Width,y=Petal.Length,shape=Species,colour=Species)) +
geom_point() + xlab("Petal Width (cm)") + ylab("Petal Length (cm)") + theme_bw() +
ggtitle("Flower Characteristics in Iris\n")
ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,shape=Species,colour=Species)) +
geom_point() + xlab("sepal Width (cm)") + ylab("sepal Length (cm)") + theme_bw() +
ggtitle("Flower Characteristics in Iris\n")
plot(iris$Sepal.Length, iris$Sepal.Width, xlab = "Length", ylab = "Width", main = "Sepal")
pairs(iris[,1:4])
pairs(iris[,1:4],col=iris[,5])
4) Aim:
Generate boxplots for each of the numerical attributes. Identify the attributes with
highest variance.
i)subset() Function :
The subset function is available in base R and can be used to return subsets of a vector,
martix, or data frame which meet a particular condition
Syntax :subset(object, logical expression, select, drop=FALSE,……)
ii)par() function :
par()can be used to set or query graphical parameters. Parameters can be set by specifying
them as arguments to par in tag=value form, or by passing them as a list of tagged values.
Syntax:par(arguments in tag=value form, no.readonly=FALSE)
▪ no.readonly – logical; if TRUE, and there are no other arguments, only
parameters are returned which can be set by a subsequent par() call on the same
device
iii)boxplot() function :
Boxplots are a measure of how well distributed is the data in a data set. It divides the data set
into three quartiles. This graph represents the minimum, maximum, median, first quartile and
third quartile in the data set. It is also useful in comparing the distribution of data across data
sets by drawing boxplots for each of them.
Syntax:boxplot(x, data, notch, varwidth, names, main)
▪ x is a vector or a formula.
▪ data is the data frame.
▪ notch is a logical value. Set as TRUE to draw a notch.
▪ varwidth is a logical value. Set as true to draw width of the box proportionate to
the sample size.
▪ names are the group labels which will be printed under each boxplot.
▪ main is used to give title to the graph
iv)Variance in a Box Plot

Variability in a data set that is described by the five-number summary is measured by
the interquartile range (IQR). The IQR is equal to Q3 – Q1, the difference between the 75th
percentile and the 25th percentile (the distance covering the middle 50% of the data). The
larger the IQR, the more variable the data set is.
Program:
>par(mar=c(7,5,1,1)) # more space to labels
>boxplot(iris,las=2)# boxplot for all attributes.
#This gives us a rough estimate of the distribution of the values for each attribute. But maybe
it makes more sense to see the distribution of the values considering each class, since we
have labels for each class.
>irisVer<- subset(iris, Species == "versicolor")
>irisSet<- subset(iris, Species == "setosa")
>irisVir<- subset(iris, Species == "virginica")
>par(mfrow=c(1,3))
>boxplot(irisVer[,1:4], main="Versicolor",ylim = c(0,8),las=2)
>boxplot(irisSet[,1:4], main="Setosa",ylim = c(0,8),las=2)
>boxplot(irisVir[,1:4], main="Virginica",ylim = c(0,8),las=2)
5)Aim:
Study of homogeneous and heterogeneous data structures such as vector, matrix, array, list,
data frame in R.
Data Structures:
➢ Vector:
The c() function can be used to create vectors of objects by concatenating things
together.Elements of a Vector are accessed using indexing. The [ ] brackets are used
for indexing. Indexing starts with position 1.
Eg: x<-c(“one”,”two”,”three”)
➢ Matrices:
This is a two-dimensional structure (like a data frame),but one where all values are
of the same type (like a vector).
● A matrix can be created directly using the matrix() function

● The following code creates a matrix from 6 values, with 3 columns and
two rows; the values are used column-first.
Eg: > matrix(1:6, ncol=3)
[,1] [,2] [,3]
1 3 5
2 4 6
●
➢ Array:
The array data structure extends the idea of a matrix to more than two dimensions.
For example, a three-dimensional array corresponds to a data cube.
The array() function can be used to create an array.

In the following code, a two-by-two-by-two, three-dimensional array is created.
Eg: x<- array(1:8, dim=c(2, 2, 2))
Accessing elements in an array:

x[1:2,] # gives first two rows
x[,2:3] # gives all the rows of 2 and 3 columns
➢ Data Frames:
Most data sets consist of more than just one variable,so to store a complete data set
we need a different data structure. In R, several variables can be stored together in an
object called a dataframe.
Creati
ng
Datafr
ame:
Eg: d
<-
c(1,2,
3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata<- data.frame(d,e,f)
names(mydata) <- c("ID","Color","Passed") # variable names
➢ List:
A listis a generic vector containing other objects. The componentsof a list can be
simple vectors similar to a data frame, but with each column allowed to have a
different length.
Creating a list:
n = c(2, 3, 5)
s = c("aa", "bb", "cc", "dd", "ee")
b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
x = list(n, s, b, 3) # x contains copies of n, s, b
Accessing elements in a list:
Eg: x[c(2, 4)]

[1] "aa" "bb" "cc" "dd" "ee"
1.Studying vector:
Here we have created a character array and accessed an element in that vector
2. Studying Matrix:
Here we have created matrices using numbers from 1 to 6 with varying number of columns
3. Studying Arrays:
Creation of a three dimensional matrix is shown below
4. Data frame:
5. List
6)Aim:
To write R program using ‘apply’ group of functions to create and apply
normalization function on each of the numeric variables/columns of the iris dataset to
transform them into a value around 0 with Z-score normalization
Z-score:
Z-scores are linearly transformed data values having a mean of zero and a standard deviation
of 1.Z-scores are also known as standardized scores; they are scores (or data values) that have
been given a common standard. This standard is a mean of zero and a standard deviation of 1.
Standardize Data (Z-score)

Standardize the scale of features to make them comparable. For each column the mean is
subtracted (centering) and it is divided by the standard deviation (scaling). Now most values
should be in [-3, 3].
Program:
znorm = function(x)
{
z = (x-mean(x))/sd(x)
return(z)
}
zscore<- as.data.frame(lapply(iris[1:4], znorm))
zscore
View(zscore)
summary(zscore)
Output:
7.a) Aim :
Use R to apply Linear regression to predict evaporation coefficient in terms of air
velocity using data given below:
air velocity evaporation
20 0.18
60 0.37
100 0.35
140 0.78
180 0.56
220 0.75
260 1.18
300 1.36
340 1.17
380 1.65
i)Linear Regression:
Linear regression is used to predict the value of an outcome variable Y based on one or more
input predictor variables X. The aim is to establish a linear relationship (a mathematical
formula) between the predictor variable(s) and the response variable, so that, we can use this
formula to estimate the value of the response Y, when only the predictors (Xs) values are
known.
The aim of linear regression is to model a continuous variable Y as a mathematical function
of one or more X variable(s), so that we can use this regression model to predict the Y when
only the X is known. This mathematical equation can be generalized as follows:
Y = β1 + β2X + ϵ
where, β1 is the intercept and β2 is the slope. Collectively, they are called regression
coefficients. ϵ is the error term, the part of Y the regression model is unable to explain.
The aim of this exercise is to build a simple regression model that we can use to predict
Evaporation by establishing a statistically significant linear relationship with air velocity.
ii) str() :
Compactly Display The Structure Of An Arbitrary R Object
Compactly display the internal structure of an R object, a diagnostic function and an
alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’
structure is displayed. It is especially well suited to compactly display the (abbreviated)
contents of (possibly nested) lists. The idea is to give reasonable output for any R object. It
calls args for (non-primitive) function objects.
Usage :
str(object, …)
# S3 method for data.frame
str(object, …)
# S3 method for default
str(object, max.level = NA,
vec.len = strO$vec.len, digits.d = strO$digits.d,
nchar.max = 128, give.attr = TRUE,
drop.deparse.attr = strO$drop.deparse.attr,
give.head = TRUE, give.length = give.head,
width = getOption("width"), nest.lev = 0,
indent.str = paste(rep.int(" ", max(0, nest.lev + 1)),
collapse = ".."),
comp.str = "$ ", no.list = FALSE, envir = baseenv(),
strict.width = strO$strict.width,
formatNum = strO$formatNum, list.len = 99, …)
strOptions(strict.width = "no", digits.d = 3, vec.len = 4,
drop.deparse.attr = TRUE,
formatNum = function(x, ...)
format(x, trim = TRUE, drop0trailing = TRUE, ...))
Arguments
object
any R object about which you want to have some information.
max.level
maximal level of nesting which is applied for displaying nested structures, e.g., a list
containing sub lists. Default NA: Display all nesting levels.
vec.len
numeric (>= 0) indicating how many ‘first few’ elements are displayed of each vector. The
number is multiplied by different factors (from .5 to 3) depending on the kind of vector.
Defaults to the vec.len component of option "str" (see options) which defaults to 4.
digits.d
number of digits for numerical components (as for print). Defaults to the digits.d component
of option "str" which defaults to 3.
nchar.max
maximal number of characters to show for character strings. Longer strings are truncated,
see longch example below.
give.attr
logical; if TRUE (default), show attributes as sub structures.
drop.deparse.attr
logical; if TRUE (default), deparse(control = <S>) will not have "showAttributes" in <S>.
Used to be hard coded to FALSE and hence can be set via strOptions() for back
compatibility.
give.length
logical; if TRUE (default), indicate length (as [1:…]).
give.head
logical; if TRUE (default), give (possibly abbreviated) mode/class and length
(as <type>[1:…]).
width
the page width to be used. The default is the currently active options("width"); note that this
has only a weak effect, unless strict.width is not "no".
nest.lev
current nesting level in the recursive calls to str.
indent.str
the indentation string to use.
comp.str
string to be used for separating list components.
no.list
logical; if true, no ‘list of …’ nor the class are printed.
envir
the environment to be used for promise (see delayedAssign) objects only.
strict.width
string indicating if the width argument's specification should be followed strictly, one of the
values c("no", "cut", "wrap"), which can be abbreviated. Defaults to
the strict.width component of option "str" (see options) which defaults to "no" for back
compatibility reasons; "wrap" usesstrwrap(*, width = width) whereas "cut" cuts directly
to width. Note that a small vec.length may be better than setting strict.width = "wrap".
formatNum
a function such as format for formatting numeric vectors. It defaults to
the formatNum component of option "str", see “Usage” of strOptions() above, which is
almost back compatible to R <= 2.7.x, however, using formatC may be slightly better.
list.len
numeric; maximum number of list elements to display within a level.
potential further arguments (required for Method/Generic reasons).
Value
str does not return anything, for efficiency reasons. The obvious side effect is output to the
terminal.
iii)lm () :
Fitting Linear Models
lm is used to fit linear models. It can be used to carry out regression, single stratum analysis
of variance and analysis of covariance
Usage
lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, …)
-Arguments:
formula
an object of class "formula" (or one that can be coerced to that class): a symbolic description
of the model to be fitted. The details of model specification are given under ‘Details’.
data
an optional data frame, list or environment (or object coercible by as.data.frame to a data
frame) containing the variables in the model. If not found in data, the variables are taken
from environment(formula), typically the environment from which lm is called.
subset
an optional vector specifying a subset of observations to be used in the fitting process.
weights
an optional vector of weights to be used in the fitting process. Should be NULL or a numeric
vector. If non-NULL, weighted least squares is used with weights weights (that is,
minimizing sum(w*e^2)); otherwise ordinary least squares is used. See also ‘Details’,
na.action
a function which indicates what should happen when the data contain NAs. The default is set
by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default
is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.
method
the method to be used; for fitting, currently only method = "qr" is supported; method =
"model.frame" returns the model frame (the same as with model = TRUE, see below).
model, x, y, qr
logicals. If TRUE the corresponding components of the fit (the model frame, the model
matrix, the response, the QR decomposition) are returned.
singular.ok
logical. If FALSE (the default in S but not in R) a singular fit is an error.
contrasts
an optional list. See the contrasts.arg of model.matrix.default.
offset
this can be used to specify an a priori known component to be included in the linear predictor
during fitting. This should be NULL or a numeric vector of length equal to the number of
cases. One or more offset terms can be included in the formula instead or as well, and if more
than one are specified their sum is used. See model.offset.
PROGRAM :
#linear regression with correlation analysis
step-1:import dataset from where u have saved,path should be checked
➔ library(readxl)
mydata<- read_excel("C:\\Users\\shweta\\Programs\\airvelocity2.xlsx")
➔ View(mydata)
➔ names(mydata)[1]<-"airvelo" #changing names of columns
➔ names(mydata) #checking new column names

names(mydata)
[1] "airvelo" "evaporation"
➔ typeof(airvelocity2) #checking typeof dataset

typeof(mydata)
[1] "list"
➔ View(mydata) #displaying dataset

-> plot(mydata$airvelo,mydata$evaporation,main = "scatterplot")
➔ plot(mydata,main="scatterplot") #scatterplot for dataset
➔ str(mydata) # findingout structure of dataset

str(mydata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 2 variables:
$ airvelo : num 20 60 100 140 180 220 260 300 340 380
$ evaporation: num 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65
➔ summary(mydata) #checking summary of dataset

➔ summary(mydata)
➔ airvelo evaporation
➔ Min. : 20 Min. :0.1800
➔ 1st Qu.:110 1st Qu.:0.4175
➔ Median :200 Median :0.7650
➔ Mean :200 Mean :0.8350
➔ 3rd Qu.:290 3rd Qu.:1.1775
➔ Max. :380 Max. :1.6500
➔ step2: Applying linear regression using lm function

->airvelo.mod1 = lm(airvelo ~evaporation , data = mydata)
airvelo.mod1 = lm(airvelo ~evaporation , data = mydata)

> summary(airvelo.mod1)
Call:
lm(formula = airvelo ~ evaporation, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-46.99 -24.88 -17.14 33.74 60.79
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.564 25.804 0.099 0.923
evaporation 236.450 27.035 8.746 2.29e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 39.53 on 8 degrees of freedom

Multiple R-squared: 0.9053, Adjusted R-squared: 0.8935
F-statistic: 76.49 on 1 and 8 DF, p-value: 2.286e-05
➔ summary(airvelo.mod1)
airvelo.mod1 = lm(evaporation~ airvelo , data = mydata)

Call:
lm(formula = evaporation ~ airvelo, data = mydata)
Residuals:
-0.20103 -0.14671 0.05261 0.12318 0.17473
Coefficients:
(Intercept) 0.0692424 0.1009737 0.686 0.512
airvelo 0.0038288 0.0004378 8.746 2.29e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

➔ step3: To predict evaporation coefficients in terms of airvelocity
->airvelo.mod1 = lm(evaporation~ airvelo , data = mydata)
->summary(airvelo.mod1)
airvelo.mod1 = lm(evaporation~ airvelo , data = mydata)
Call:
lm(formula = evaporation ~ airvelo, data = mydata)
Residuals:
-0.20103 -0.14671 0.05261 0.12318 0.17473
Coefficients:
(Intercept) 0.0692424 0.1009737 0.686 0.512
airvelo 0.0038288 0.0004378 8.746 2.29e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

7)b).Analyze the significance of residual standard-error value, R-squared
value, F statistic. Find the correlation coefficient for this data and analyze
the significance of the correlation value.
Residual Standard error value:

In statistics and optimization, errors and residuals are two closely related and easily
confused measures of the deviation of an observed value of an element of a statistical
sample from its "theoretical value". The error (or disturbance) of an observed value is the
deviation of the observed value from the (unobservable) true value of a quantity of interest
(for example, a population mean), and the residual of an observed value is the difference
between the observed value and the estimated value of the quantity of interest (for example,
a sample mean).
R-Squared value:
R-squared is a statistical measure of how close the data are to the fitted regression line. It is
also known as the coefficient of determination, or the coefficient of multiple determination
for multiple regression.
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
 0% indicates that the model explains none of the variability of the response data
around its mean.
 100% indicates that the model explains all the variability of the response data around
its mean.
In general, the higher the R-squared, the better the model fits your data.
F-statistic:
F-tests are named after its test statistic, F, which was named in honor of Sir Ronald Fisher.
The F-statistic is simply a ratio of two variances. Variances are a measure of dispersion, or
how far the data are scattered from the mean. Larger values represent greater dispersion.
F-statistics are based on the ratio of mean squares. The term “mean squares” may sound
confusing but it is simply an estimate of population variance that accounts for the degrees of
freedom (DF)used to calculate that estimate.
Correlation Coefficient:
A correlation coefficient is a numerical measure of some type of correlation, meaning a

statistical relationship between two variables. The variables may be two columns of a
given data set of observations, often called a sample, or two components of a multivariate
random variable with a known distribution.
Several types of correlation coefficient exist, each with their own definition and own range of
usability and characteristics. They all assume values in the range from −1 to +1, where +1
indicates the strongest possible agreement and −1 the strongest possible disagreement.
Finding the coefficients and drawing scatter plot:
Using abline function we have added a line in the scatter plot:

After correlation test:
Tau-static:
Rho static:
Aim :Perform a log transformation on the ‘Air velocity’

column,perform linear regression again, and analyse all the
relevant values.
Log transformation :
The log transformation can be used to make highly skewed distributions less skewed.
This can be valuable both for making patterns in the data more interpretable and for helping
to meet the assumptions of inferential statistics. The log transformation is a relatively strong
transformation. Because certain measurements in nature are naturally log-normal, it is often
a successful transformation for certain data sets. While the transformed data here does not
follow a normal distribution very well,it is probably about as close as we can get with these
particular data.
Syntax:
logTransform(transformationId="defaultLogTransform", logbase=10, r=1, d=1)
Arguments:
transformationId
character string to identify the transformation
logbase
positive double that corresponds to the base of the logarithm.
r
positive double that corresponds to a scale factor.
d
positive double that corresponds to a scale factor
Value:
Returns an object of class transform.
Program:
T_log = log(mydata$airvelo)
install.packages("rcompanion")
library(rcompanion)
plotNormalHistogram(T_log)
str(T_log)
View(T_log)
summary(T_log)
plot(T_log,main="scatterplot")
linearMod<- lm(T_log ~airvelo , data=mydata)

linearMod
summary(linearMod)
attributes(linearMod)
linearMod$coefficients

Rintro Wekacomplete

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Rintro Wekacomplete

Transféré par

Droits d'auteur :

Formats disponibles

INTRODUCTION TO WEKA:

Weka contains a collection of visualization tools and algorithms for data

● A comprehensive collection of data preprocessing and modeling techniques.

● Ease of use due to its graphical user interfaces.

Native Regression Tools:

● RandomForest (several model trees combined)

● ZeroR (the average value of outputs)

● SMOreg (support vector regression)

● SimpleLinearRegression (uses an intercept and only 1 input variable for multivariate

● MultiLayerPerceptron (neural network)

● The Classify panel enables applying classification and regression algorithms

Creation Of ARFF File:

Unsupervised attribute filters:

Formula to calculate normalization

Xnew = (X- Xmin)/(Xmax-Xmin)

Formula to calculate standardization,

what type of algorithm benefits from standardization and what not.

3.Bring ARFF LOADER from source to knowledge layout and name.

6.Configure ARFF loader and three ARFF servers .

7.Link ARFF Loader to discretize standardize and normalized(data set).

8.Link discretized , standardized , normalized to ARFF servers.

9.Start loading from ARFF savers.

=== Classifier model (full training set) ===

Time taken to build model: 0.02 seconds

=== Stratified cross-validation ===

Correctly Classified Instances 12 85.7143 %

=== Detailed Accuracy By Class ===

=== Confusion Matrix ===

Naive Bayes Algorithm:

=== Classifier model (full training set) ===

Naive Bayes Classifier

=== Stratified cross-validation ===

Correctly Classified Instances 392 90.1149 %

=== Detailed Accuracy By Class ===

=== Confusion Matrix ===

Random Forest Algorithm:

Advantages of Random Forest:

➢ It runs efficiently on large databases.

➢ It can handle thousands of input variables without variable deletion.

➢ It gives estimates of what variables are important in the classification.

➢ It generates an internal unbiased estimate of the generalization error as the forest

Scheme: weka.classifiers.trees.RandomForest -P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V

=== Classifier model (full training set) ===

Bagging with 100 iterations and base learner

weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities

Time taken to build model: 0.75 seconds

=== Stratified cross-validation ===

Correctly Classified Instances 582 75.7813 %

=== Detailed Accuracy By Class ===

=== Confusion Matrix ===

The value of Iris-setosa is accurate.

Hierarchical Clustering Algorithm:

Scheme: weka.clusterers.HierarchicalClusterer -N 2 -L SINGLE -P -A

=== Clustering model (full training set) ===

Time taken to build model (full training data) : 2.18 seconds

=== Model and evaluation on training set ===

STEP 1: Choose the data file required to cluster.

=== Run information ===

Scheme: weka.clusterers.DBSCAN -E 0.9 -M 6 -A "weka.core.EuclideanDistance -R first-last"

Test mode: evaluate on training data

=== Clustering model (full training set) ===

DBSCAN clustering results

Clustered DataObjects: 768