Data Minning Assignment #1: Submitted By: Rahul Kumar Roll No: 160BTCCSE010 Class: CSE A, 3rd Year

Data Minning Assignment #1
Submitted By : Rahul Kumar
Roll No : 160BTCCSE010
Class : CSE A, 3rd year
Let's start by importing the libraries that we will be using for analysis and visualization of data
In [1]:
1 import numpy as np
2 import pandas as pd
3 import matplotlib.pyplot as plt
4 %matplotlib inline
First step is to read the data from the csv file
I will be using the first row of csv file as names for columns
In [2]:
1 dataframe = pd.read_csv('assign.csv',header=0)
Let's see the rows and columns in our dataframe
In [3]:
1 dataframe.shape
Out[3]:
(1353, 10)
Cool, Now let's look at the first 5 rows to find out the type of data we are dealing with
In [4]:
1 dataframe.head()
Out[4]:
SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLength ageO
0 1 -1 1 -1 -1 1 1
1 -@ -1 -1 -1 -1 0 1
2 1 -1 0 0 -1 0 -1
3 1 0 1 -1 -1 0 1
4 -1 -1 1 -1 0 0 -1
Looks like the values belong to {1,0,-1} for the most part, just like it was described in the data file.
The dataframe also has some bad values like @,% etc.
let us look at the type of values our dataframe has
In [5]:
1 dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1353 entries, 0 to 1352
Data columns (total 10 columns):
SFH 1353 non-null object
popUpWindow 1353 non-null int64
SSLfinalState 1353 non-null object
requestURL 1353 non-null int64
URLOfAnchor 1353 non-null object
webTraffic 1353 non-null object
URLLength 1353 non-null int64
ageOfDomain 1353 non-null int64
havingIPAddress 1353 non-null int64
Result 1352 non-null object
dtypes: int64(5), object(5)
memory usage: 105.8+ KB
Hmm... It has mixed data types and some of which are non numeric.
So let's convert the dataframe to numeric first, we will be replacing any non numeric values with NaN
In [6]:
1 dataframe = dataframe.apply(pd.to_numeric,errors='coerce')
Now the types again
In [7]:
1 dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1353 entries, 0 to 1352
Data columns (total 10 columns):
SFH 1352 non-null float64
popUpWindow 1353 non-null int64
SSLfinalState 1352 non-null float64
requestURL 1353 non-null int64
URLOfAnchor 1351 non-null float64
webTraffic 1352 non-null float64
URLLength 1353 non-null int64
ageOfDomain 1353 non-null int64
havingIPAddress 1353 non-null int64
Result 1350 non-null float64
dtypes: float64(5), int64(5)
memory usage: 105.8 KB
NICE all values are numeric now, which is easy to work with
So let's start the analysis now, first check for duplicate tuples
In [8]:
1 dataframe.duplicated().sum()
Out[8]:
626
WHAAAT? 626! duplicate rows!!!
That is almost half of our data.. let's just look at the duplicate rows to start..
In [9]:
1 dataframe.loc[dataframe.duplicated(),:].head()
Out[9]:
SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLength age
46 0.0 -1 -1.0 0 1.0 1.0 0
55 1.0 0 1.0 1 1.0 -1.0 1
56 1.0 -1 0.0 1 0.0 0.0 -1
63 1.0 0 1.0 1 1.0 -1.0 1
72 1.0 0 1.0 0 -1.0 1.0 0
Hmm.. Looks to be normal nothing out of ordinary so let's drop these duplicate values
In [10]:
1 dataframe = dataframe.drop_duplicates(keep='first')
Let's see the shape now
In [11]:
1 dataframe.shape
Out[11]:
(727, 10)
Next step: Finding NaN values

In [12]:
1 dataframe.isnull().sum()
Out[12]:
SFH 1
popUpWindow 0
SSLfinalState 1
requestURL 0
URLOfAnchor 2
webTraffic 1
URLLength 0
ageOfDomain 0
havingIPAddress 0
Result 3
dtype: int64
Not many NaN values but there are few with kind off an even spread
Let's just see if any tuple has all of it's values as 'NaN' and drop those tuple (Which none does but lets do it
anyway)
In [13]:
1 dataframe.dropna(how='all').shape
Out[13]:
(727, 10)
Okay so as expected still same shape, let's see what happens if we drop row which has any NaN value
In [14]:
1 dataframe.dropna().shape
Out[14]:
(720, 10)
We lose 7 tuples!
let's calculate the percentage of NaN values
In [15]:
1 nanValues = dataframe.isnull().sum().sum()
2 totalValues = dataframe.shape[0]*dataframe.shape[1]
3 nanValuePer = (nanValues / totalValues) * 100
4
5 nanValuePer
Out[15]:
0.11004126547455295
.11% That's quite small! :D

Let's see thse tuples with NaN values
In [16]:
1 dataframe.loc[dataframe.isnull().any(axis=1)]
Out[16]:
SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLength ag
1 NaN -1 -1.0 -1 -1.0 0.0 1
2 1.0 -1 0.0 0 -1.0 0.0 -1
27 1.0 0 -1.0 1 NaN 0.0 -1
34 1.0 -1 1.0 -1 NaN -1.0 -1
37 0.0 0 1.0 0 0.0 NaN -1
78 1.0 0 1.0 1 0.0 0.0 -1
286 1.0 0 NaN 0 -1.0 0.0 0
Okay let's see if we can somehow fill the values without messing the dataframe, always better to fill the value
than to drop
let's start by seeing the distribution of values these column with NaN have..
In [17]:
1 tempList = []
2 for column in dataframe.columns[dataframe.isna().any()]:
3 temp = dataframe[column].value_counts()
4 tempList.append(temp)
5
6 for _ in tempList:
7 print(_)
1.0 388
-1.0 253
0.0 85
Name: SFH, dtype: int64
1.0 367
0.0 188
-1.0 171
Name: SSLfinalState, dtype: int64
-1.0 309
1.0 291
0.0 125
Name: URLOfAnchor, dtype: int64
0.0 263
1.0 259
-1.0 204
Name: webTraffic, dtype: int64
-1.0 354
1.0 309
0.0 61
Name: Result, dtype: int64
Okay a lot of info here but from what we can see, it's mostly evenly distributed between 2 or 3 values.
So we can't fill the NaN values with the most frequent one, to take a better look, let's plot the values of columns
in a histogram
In [18]:
1 dataframe[dataframe.columns[dataframe.isna().any()]].plot(kind='hist',subplots = True,l
Out[18]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B4C99B00

>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B6D1DE80
>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B6D642E8
>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B6D77630
>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B6DC4048
>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B6DC4080
>]],
dtype=object)
As we can see even though some columns have much more extremes of -1,1 and not many 0s there isn't a
single value which dominates.
Hence we can't fill NaN by using most dominant/frequent value.
Since this data seems to be categorical, taking value from mean will only mess up the data.
So our only option is to drop the rows with NaN values, yes we do lose 7 tuples but at least the rest of data will
be of good quality.
In [19]:
1 dataframe = dataframe.dropna(how='any')
2 dataframe.shape
Out[19]:
(720, 10)
Okay so finally we have a dataframe with 720 rows and 10 columns,
Next we need to see if our data has any noisy/outliers.
let's use the .describe() method for that
In [20]:
1 dataframe.describe()
Out[20]:
SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLe
count 720.000000 720.000000 720.000000 720.000000 720.000000 720.000000 720.00
mean 0.180556 -0.283333 0.270833 -0.187500 -0.020833 0.077778 -0.05
std 0.922993 0.713666 0.817224 0.794354 0.910980 0.797810 0.79
min -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.00
25% -1.000000 -1.000000 0.000000 -1.000000 -1.000000 -1.000000 -1.00
50% 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00
75% 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.00
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00
The values seem to be in range of -1 to 1, so there aren't any outliers. This also shows that our data is already
normalized.
Let's draw boxplot to see the distribution of data visually
In [21]:
1 dataframe.plot(kind='box',figsize=(15,5))
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x143b6fbaef0>
Interesting, most of the columns have a relatively high spread expcept for the havingIPAddress column, this
graph agrees with our stats just above it.
Where as in case of havingIPAddress we can see most of the values are 0 except for a small minority
If we had some NaN values in having havingIPAddress column, we could have replaced them with 0 but since
that is not the case let's move on.
Now Let's find out the correlation of columns to see if we should drop some columns or not
In [22]:
1 dataframe.corr()
Out[22]:
SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic
SFH 1.000000 0.299472 0.171096 0.165747 0.232747 -0.079537
popUpWindow 0.299472 1.000000 0.129370 0.070534 0.102151 -0.056509
SSLfinalState 0.171096 0.129370 1.000000 -0.043787 -0.018565 -0.079284
requestURL 0.165747 0.070534 -0.043787 1.000000 0.292501 -0.029627
URLOfAnchor 0.232747 0.102151 -0.018565 0.292501 1.000000 -0.022645
webTraffic -0.079537 -0.056509 -0.079284 -0.029627 -0.022645 1.000000
URLLength 0.045743 0.104208 -0.006754 0.001097 0.015626 -0.074040
ageOfDomain 0.048322 -0.009622 0.123301 0.014654 0.027233 -0.614813
havingIPAddress 0.039435 0.139066 0.156398 0.000591 0.014253 -0.113783
Result -0.567642 -0.464606 -0.354180 -0.193243 -0.170900 0.126826
As we can see there isn't much correlation between the columns except for some cases such as SFH and
popUpWindow being negatively correlated with Result.
This means whenever SFH and popUpWindow is present it is more likely that the website is a phishing
website.
Which makes sense, as SFH can be used to send back user's information to server and popUpWindows can be
used to submit information.
Okay that is enough analysis, we have processed the data and derrived some insight from it.
Now let's build a model to fit to this data
First I'll import some clases and functions to make our lives easier in building the model and splitting the data
In [23]:
1 from sklearn import linear_model

2 from sklearn import metrics
3 from sklearn.model_selection import train_test_split
Next, lets split the data into train and test
I will be using a 80:20 split between train and test data

In [24]:
1 train_x, test_x, train_y, test_y = train_test_split(dataframe.iloc[:,:-1], dataframe.il
Let's pick our model, since it is a multiclass classification problem, I will pick the simplest model for classification
LogisticRegression
In [25]:
1 clf_logistic = linear_model.LogisticRegression(multi_class='multinomial',solver='newton
Let's fit the model to our training data!
In [26]:
1 clf_logistic = clf_logistic.fit(train_x,train_y)
OKAY MOMENT OF TRUTH, Lets SEE how our Model perfroms!!
In [27]:
1 metrics.accuracy_score(test_y,clf_logistic.predict(test_x))
Out[27]:
0.7777777777777778
Hmm, not too bad but not too great either, this means we are lacking in features and data.
We require much more features/attributes to explain if the site is pishing or not.
Anyways, This concludes the assignment

Data Minning Assignment #1: Submitted By: Rahul Kumar Roll No: 160BTCCSE010 Class: CSE A, 3rd Year

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Minning Assignment #1: Submitted By: Rahul Kumar Roll No: 160BTCCSE010 Class: CSE A, 3rd Year

Transféré par

Droits d'auteur :

Formats disponibles

Data Minning Assignment #1

Submitted By : Rahul Kumar

Class : CSE A, 3rd year

First step is to read the data from the csv file

Let's see the rows and columns in our dataframe

SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLength ageO

let us look at the type of values our dataframe has

Now the types again

WHAAAT? 626! duplicate rows!!!

SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLength age

46 0.0 -1 -1.0 0 1.0 1.0 0

55 1.0 0 1.0 1 1.0 -1.0 1

56 1.0 -1 0.0 1 0.0 0.0 -1

63 1.0 0 1.0 1 1.0 -1.0 1

72 1.0 0 1.0 0 -1.0 1.0 0

Let's see the shape now

Next step: Finding NaN values

let's calculate the percentage of NaN values

.11% That's quite small! :D

SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLength ag

1 NaN -1 -1.0 -1 -1.0 0.0 1

2 1.0 -1 0.0 0 -1.0 0.0 -1

27 1.0 0 -1.0 1 NaN 0.0 -1

34 1.0 -1 1.0 -1 NaN -1.0 -1

37 0.0 0 1.0 0 0.0 NaN -1

78 1.0 0 1.0 1 0.0 0.0 -1

286 1.0 0 NaN 0 -1.0 0.0 0

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B4C99B00

Hence we can't fill NaN by using most dominant/frequent value.

Next we need to see if our data has any noisy/outliers.

let's use the .describe() method for that

SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLe

count 720.000000 720.000000 720.000000 720.000000 720.000000 720.000000 720.00

mean 0.180556 -0.283333 0.270833 -0.187500 -0.020833 0.077778 -0.05

std 0.922993 0.713666 0.817224 0.794354 0.910980 0.797810 0.79

min -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.00

25% -1.000000 -1.000000 0.000000 -1.000000 -1.000000 -1.000000 -1.00

50% 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00

75% 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.00

max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00

Let's draw boxplot to see the distribution of data visually

SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic

SFH 1.000000 0.299472 0.171096 0.165747 0.232747 -0.079537

popUpWindow 0.299472 1.000000 0.129370 0.070534 0.102151 -0.056509

SSLfinalState 0.171096 0.129370 1.000000 -0.043787 -0.018565 -0.079284

requestURL 0.165747 0.070534 -0.043787 1.000000 0.292501 -0.029627

URLOfAnchor 0.232747 0.102151 -0.018565 0.292501 1.000000 -0.022645

webTraffic -0.079537 -0.056509 -0.079284 -0.029627 -0.022645 1.000000

URLLength 0.045743 0.104208 -0.006754 0.001097 0.015626 -0.074040

ageOfDomain 0.048322 -0.009622 0.123301 0.014654 0.027233 -0.614813

havingIPAddress 0.039435 0.139066 0.156398 0.000591 0.014253 -0.113783

Result -0.567642 -0.464606 -0.354180 -0.193243 -0.170900 0.126826

Now let's build a model to fit to this data

1 from sklearn import linear_model

Next, lets split the data into train and test

I will be using a 80:20 split between train and test data

1 train_x, test_x, train_y, test_y = train_test_split(dataframe.iloc[:,:-1], dataframe.il

Let's fit the model to our training data!

OKAY MOMENT OF TRUTH, Lets SEE how our Model perfroms!!

We require much more features/attributes to explain if the site is pishing or not.

Anyways, This concludes the assignment

Vous aimerez peut-être aussi