Vous êtes sur la page 1sur 9

Data Minning Assignment #1

Submitted By : Rahul Kumar

Roll No : 160BTCCSE010

Class : CSE A, 3rd year

Let's start by importing the libraries that we will be using for analysis and visualization of data

In [1]:

1 import numpy as np
2 import pandas as pd
3 import matplotlib.pyplot as plt
4 %matplotlib inline

First step is to read the data from the csv file

I will be using the first row of csv file as names for columns

In [2]:

1 dataframe = pd.read_csv('assign.csv',header=0)

Let's see the rows and columns in our dataframe

In [3]:

1 dataframe.shape

Out[3]:

(1353, 10)

Cool, Now let's look at the first 5 rows to find out the type of data we are dealing with

In [4]:

1 dataframe.head()

Out[4]:

SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLength ageO

0 1 -1 1 -1 -1 1 1

1 -@ -1 -1 -1 -1 0 1

2 1 -1 0 0 -1 0 -1

3 1 0 1 -1 -1 0 1

4 -1 -1 1 -1 0 0 -1
Looks like the values belong to {1,0,-1} for the most part, just like it was described in the data file.

The dataframe also has some bad values like @,% etc.

let us look at the type of values our dataframe has

In [5]:

1 dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1353 entries, 0 to 1352
Data columns (total 10 columns):
SFH 1353 non-null object
popUpWindow 1353 non-null int64
SSLfinalState 1353 non-null object
requestURL 1353 non-null int64
URLOfAnchor 1353 non-null object
webTraffic 1353 non-null object
URLLength 1353 non-null int64
ageOfDomain 1353 non-null int64
havingIPAddress 1353 non-null int64
Result 1352 non-null object
dtypes: int64(5), object(5)
memory usage: 105.8+ KB

Hmm... It has mixed data types and some of which are non numeric.

So let's convert the dataframe to numeric first, we will be replacing any non numeric values with NaN

In [6]:

1 dataframe = dataframe.apply(pd.to_numeric,errors='coerce')

Now the types again

In [7]:

1 dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1353 entries, 0 to 1352
Data columns (total 10 columns):
SFH 1352 non-null float64
popUpWindow 1353 non-null int64
SSLfinalState 1352 non-null float64
requestURL 1353 non-null int64
URLOfAnchor 1351 non-null float64
webTraffic 1352 non-null float64
URLLength 1353 non-null int64
ageOfDomain 1353 non-null int64
havingIPAddress 1353 non-null int64
Result 1350 non-null float64
dtypes: float64(5), int64(5)
memory usage: 105.8 KB

NICE all values are numeric now, which is easy to work with
So let's start the analysis now, first check for duplicate tuples

In [8]:

1 dataframe.duplicated().sum()

Out[8]:

626

WHAAAT? 626! duplicate rows!!!

That is almost half of our data.. let's just look at the duplicate rows to start..

In [9]:

1 dataframe.loc[dataframe.duplicated(),:].head()

Out[9]:

SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLength age

46 0.0 -1 -1.0 0 1.0 1.0 0

55 1.0 0 1.0 1 1.0 -1.0 1

56 1.0 -1 0.0 1 0.0 0.0 -1

63 1.0 0 1.0 1 1.0 -1.0 1

72 1.0 0 1.0 0 -1.0 1.0 0

Hmm.. Looks to be normal nothing out of ordinary so let's drop these duplicate values

In [10]:

1 dataframe = dataframe.drop_duplicates(keep='first')

Let's see the shape now

In [11]:

1 dataframe.shape

Out[11]:

(727, 10)

Next step: Finding NaN values


In [12]:

1 dataframe.isnull().sum()

Out[12]:

SFH 1
popUpWindow 0
SSLfinalState 1
requestURL 0
URLOfAnchor 2
webTraffic 1
URLLength 0
ageOfDomain 0
havingIPAddress 0
Result 3
dtype: int64

Not many NaN values but there are few with kind off an even spread

Let's just see if any tuple has all of it's values as 'NaN' and drop those tuple (Which none does but lets do it
anyway)

In [13]:

1 dataframe.dropna(how='all').shape

Out[13]:

(727, 10)

Okay so as expected still same shape, let's see what happens if we drop row which has any NaN value

In [14]:

1 dataframe.dropna().shape

Out[14]:

(720, 10)

We lose 7 tuples!

let's calculate the percentage of NaN values

In [15]:

1 nanValues = dataframe.isnull().sum().sum()
2 totalValues = dataframe.shape[0]*dataframe.shape[1]
3 nanValuePer = (nanValues / totalValues) * 100
4
5 nanValuePer

Out[15]:

0.11004126547455295

.11% That's quite small! :D


Let's see thse tuples with NaN values

In [16]:

1 dataframe.loc[dataframe.isnull().any(axis=1)]

Out[16]:

SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLength ag

1 NaN -1 -1.0 -1 -1.0 0.0 1

2 1.0 -1 0.0 0 -1.0 0.0 -1

27 1.0 0 -1.0 1 NaN 0.0 -1

34 1.0 -1 1.0 -1 NaN -1.0 -1

37 0.0 0 1.0 0 0.0 NaN -1

78 1.0 0 1.0 1 0.0 0.0 -1

286 1.0 0 NaN 0 -1.0 0.0 0

Okay let's see if we can somehow fill the values without messing the dataframe, always better to fill the value
than to drop

let's start by seeing the distribution of values these column with NaN have..

In [17]:

1 tempList = []
2 for column in dataframe.columns[dataframe.isna().any()]:
3 temp = dataframe[column].value_counts()
4 tempList.append(temp)
5
6 for _ in tempList:
7 print(_)

1.0 388
-1.0 253
0.0 85
Name: SFH, dtype: int64
1.0 367
0.0 188
-1.0 171
Name: SSLfinalState, dtype: int64
-1.0 309
1.0 291
0.0 125
Name: URLOfAnchor, dtype: int64
0.0 263
1.0 259
-1.0 204
Name: webTraffic, dtype: int64
-1.0 354
1.0 309
0.0 61
Name: Result, dtype: int64

Okay a lot of info here but from what we can see, it's mostly evenly distributed between 2 or 3 values.
So we can't fill the NaN values with the most frequent one, to take a better look, let's plot the values of columns
in a histogram

In [18]:

1 dataframe[dataframe.columns[dataframe.isna().any()]].plot(kind='hist',subplots = True,l

Out[18]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B4C99B00


>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B6D1DE80
>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B6D642E8
>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B6D77630
>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B6DC4048
>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000143B6DC4080
>]],
dtype=object)

As we can see even though some columns have much more extremes of -1,1 and not many 0s there isn't a
single value which dominates.

Hence we can't fill NaN by using most dominant/frequent value.

Since this data seems to be categorical, taking value from mean will only mess up the data.

So our only option is to drop the rows with NaN values, yes we do lose 7 tuples but at least the rest of data will
be of good quality.

In [19]:

1 dataframe = dataframe.dropna(how='any')
2 dataframe.shape

Out[19]:

(720, 10)
Okay so finally we have a dataframe with 720 rows and 10 columns,

Next we need to see if our data has any noisy/outliers.

let's use the .describe() method for that

In [20]:

1 dataframe.describe()

Out[20]:

SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic URLLe

count 720.000000 720.000000 720.000000 720.000000 720.000000 720.000000 720.00

mean 0.180556 -0.283333 0.270833 -0.187500 -0.020833 0.077778 -0.05

std 0.922993 0.713666 0.817224 0.794354 0.910980 0.797810 0.79

min -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.00

25% -1.000000 -1.000000 0.000000 -1.000000 -1.000000 -1.000000 -1.00

50% 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00

75% 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.00

max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00

The values seem to be in range of -1 to 1, so there aren't any outliers. This also shows that our data is already
normalized.

Let's draw boxplot to see the distribution of data visually

In [21]:

1 dataframe.plot(kind='box',figsize=(15,5))

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x143b6fbaef0>

Interesting, most of the columns have a relatively high spread expcept for the havingIPAddress column, this
graph agrees with our stats just above it.

Where as in case of havingIPAddress we can see most of the values are 0 except for a small minority
If we had some NaN values in having havingIPAddress column, we could have replaced them with 0 but since
that is not the case let's move on.

Now Let's find out the correlation of columns to see if we should drop some columns or not

In [22]:

1 dataframe.corr()

Out[22]:

SFH popUpWindow SSLfinalState requestURL URLOfAnchor webTraffic

SFH 1.000000 0.299472 0.171096 0.165747 0.232747 -0.079537

popUpWindow 0.299472 1.000000 0.129370 0.070534 0.102151 -0.056509

SSLfinalState 0.171096 0.129370 1.000000 -0.043787 -0.018565 -0.079284

requestURL 0.165747 0.070534 -0.043787 1.000000 0.292501 -0.029627

URLOfAnchor 0.232747 0.102151 -0.018565 0.292501 1.000000 -0.022645

webTraffic -0.079537 -0.056509 -0.079284 -0.029627 -0.022645 1.000000

URLLength 0.045743 0.104208 -0.006754 0.001097 0.015626 -0.074040

ageOfDomain 0.048322 -0.009622 0.123301 0.014654 0.027233 -0.614813

havingIPAddress 0.039435 0.139066 0.156398 0.000591 0.014253 -0.113783

Result -0.567642 -0.464606 -0.354180 -0.193243 -0.170900 0.126826

As we can see there isn't much correlation between the columns except for some cases such as SFH and
popUpWindow being negatively correlated with Result.

This means whenever SFH and popUpWindow is present it is more likely that the website is a phishing
website.

Which makes sense, as SFH can be used to send back user's information to server and popUpWindows can be
used to submit information.

Okay that is enough analysis, we have processed the data and derrived some insight from it.

Now let's build a model to fit to this data

First I'll import some clases and functions to make our lives easier in building the model and splitting the data

In [23]:

1 from sklearn import linear_model


2 from sklearn import metrics
3 from sklearn.model_selection import train_test_split

Next, lets split the data into train and test

I will be using a 80:20 split between train and test data


In [24]:

1 train_x, test_x, train_y, test_y = train_test_split(dataframe.iloc[:,:-1], dataframe.il

Let's pick our model, since it is a multiclass classification problem, I will pick the simplest model for classification

LogisticRegression

In [25]:

1 clf_logistic = linear_model.LogisticRegression(multi_class='multinomial',solver='newton

Let's fit the model to our training data!

In [26]:

1 clf_logistic = clf_logistic.fit(train_x,train_y)

OKAY MOMENT OF TRUTH, Lets SEE how our Model perfroms!!

In [27]:

1 metrics.accuracy_score(test_y,clf_logistic.predict(test_x))

Out[27]:

0.7777777777777778

Hmm, not too bad but not too great either, this means we are lacking in features and data.

We require much more features/attributes to explain if the site is pishing or not.

Anyways, This concludes the assignment

Vous aimerez peut-être aussi