Académique Documents
Professionnel Documents
Culture Documents
Roll No : 160BTCCSE010
Let's start by importing the libraries that we will be using for analysis and visualization of data
In [1]:
1 import numpy as np
2 import pandas as pd
3 import matplotlib.pyplot as plt
4 %matplotlib inline
I will be using the first row of csv file as names for columns
In [2]:
1 dataframe = pd.read_csv('assign.csv',header=0)
In [3]:
1 dataframe.shape
Out[3]:
(1353, 10)
Cool, Now let's look at the first 5 rows to find out the type of data we are dealing with
In [4]:
1 dataframe.head()
Out[4]:
0 1 -1 1 -1 -1 1 1
1 -@ -1 -1 -1 -1 0 1
2 1 -1 0 0 -1 0 -1
3 1 0 1 -1 -1 0 1
4 -1 -1 1 -1 0 0 -1
Looks like the values belong to {1,0,-1} for the most part, just like it was described in the data file.
The dataframe also has some bad values like @,% etc.
In [5]:
1 dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1353 entries, 0 to 1352
Data columns (total 10 columns):
SFH 1353 non-null object
popUpWindow 1353 non-null int64
SSLfinalState 1353 non-null object
requestURL 1353 non-null int64
URLOfAnchor 1353 non-null object
webTraffic 1353 non-null object
URLLength 1353 non-null int64
ageOfDomain 1353 non-null int64
havingIPAddress 1353 non-null int64
Result 1352 non-null object
dtypes: int64(5), object(5)
memory usage: 105.8+ KB
Hmm... It has mixed data types and some of which are non numeric.
So let's convert the dataframe to numeric first, we will be replacing any non numeric values with NaN
In [6]:
1 dataframe = dataframe.apply(pd.to_numeric,errors='coerce')
In [7]:
1 dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1353 entries, 0 to 1352
Data columns (total 10 columns):
SFH 1352 non-null float64
popUpWindow 1353 non-null int64
SSLfinalState 1352 non-null float64
requestURL 1353 non-null int64
URLOfAnchor 1351 non-null float64
webTraffic 1352 non-null float64
URLLength 1353 non-null int64
ageOfDomain 1353 non-null int64
havingIPAddress 1353 non-null int64
Result 1350 non-null float64
dtypes: float64(5), int64(5)
memory usage: 105.8 KB
NICE all values are numeric now, which is easy to work with
So let's start the analysis now, first check for duplicate tuples
In [8]:
1 dataframe.duplicated().sum()
Out[8]:
626
That is almost half of our data.. let's just look at the duplicate rows to start..
In [9]:
1 dataframe.loc[dataframe.duplicated(),:].head()
Out[9]:
Hmm.. Looks to be normal nothing out of ordinary so let's drop these duplicate values
In [10]:
1 dataframe = dataframe.drop_duplicates(keep='first')
In [11]:
1 dataframe.shape
Out[11]:
(727, 10)
1 dataframe.isnull().sum()
Out[12]:
SFH 1
popUpWindow 0
SSLfinalState 1
requestURL 0
URLOfAnchor 2
webTraffic 1
URLLength 0
ageOfDomain 0
havingIPAddress 0
Result 3
dtype: int64
Not many NaN values but there are few with kind off an even spread
Let's just see if any tuple has all of it's values as 'NaN' and drop those tuple (Which none does but lets do it
anyway)
In [13]:
1 dataframe.dropna(how='all').shape
Out[13]:
(727, 10)
Okay so as expected still same shape, let's see what happens if we drop row which has any NaN value
In [14]:
1 dataframe.dropna().shape
Out[14]:
(720, 10)
We lose 7 tuples!
In [15]:
1 nanValues = dataframe.isnull().sum().sum()
2 totalValues = dataframe.shape[0]*dataframe.shape[1]
3 nanValuePer = (nanValues / totalValues) * 100
4
5 nanValuePer
Out[15]:
0.11004126547455295
In [16]:
1 dataframe.loc[dataframe.isnull().any(axis=1)]
Out[16]:
Okay let's see if we can somehow fill the values without messing the dataframe, always better to fill the value
than to drop
let's start by seeing the distribution of values these column with NaN have..
In [17]:
1 tempList = []
2 for column in dataframe.columns[dataframe.isna().any()]:
3 temp = dataframe[column].value_counts()
4 tempList.append(temp)
5
6 for _ in tempList:
7 print(_)
1.0 388
-1.0 253
0.0 85
Name: SFH, dtype: int64
1.0 367
0.0 188
-1.0 171
Name: SSLfinalState, dtype: int64
-1.0 309
1.0 291
0.0 125
Name: URLOfAnchor, dtype: int64
0.0 263
1.0 259
-1.0 204
Name: webTraffic, dtype: int64
-1.0 354
1.0 309
0.0 61
Name: Result, dtype: int64
Okay a lot of info here but from what we can see, it's mostly evenly distributed between 2 or 3 values.
So we can't fill the NaN values with the most frequent one, to take a better look, let's plot the values of columns
in a histogram
In [18]:
1 dataframe[dataframe.columns[dataframe.isna().any()]].plot(kind='hist',subplots = True,l
Out[18]:
As we can see even though some columns have much more extremes of -1,1 and not many 0s there isn't a
single value which dominates.
Since this data seems to be categorical, taking value from mean will only mess up the data.
So our only option is to drop the rows with NaN values, yes we do lose 7 tuples but at least the rest of data will
be of good quality.
In [19]:
1 dataframe = dataframe.dropna(how='any')
2 dataframe.shape
Out[19]:
(720, 10)
Okay so finally we have a dataframe with 720 rows and 10 columns,
In [20]:
1 dataframe.describe()
Out[20]:
The values seem to be in range of -1 to 1, so there aren't any outliers. This also shows that our data is already
normalized.
In [21]:
1 dataframe.plot(kind='box',figsize=(15,5))
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x143b6fbaef0>
Interesting, most of the columns have a relatively high spread expcept for the havingIPAddress column, this
graph agrees with our stats just above it.
Where as in case of havingIPAddress we can see most of the values are 0 except for a small minority
If we had some NaN values in having havingIPAddress column, we could have replaced them with 0 but since
that is not the case let's move on.
Now Let's find out the correlation of columns to see if we should drop some columns or not
In [22]:
1 dataframe.corr()
Out[22]:
As we can see there isn't much correlation between the columns except for some cases such as SFH and
popUpWindow being negatively correlated with Result.
This means whenever SFH and popUpWindow is present it is more likely that the website is a phishing
website.
Which makes sense, as SFH can be used to send back user's information to server and popUpWindows can be
used to submit information.
Okay that is enough analysis, we have processed the data and derrived some insight from it.
First I'll import some clases and functions to make our lives easier in building the model and splitting the data
In [23]:
Let's pick our model, since it is a multiclass classification problem, I will pick the simplest model for classification
LogisticRegression
In [25]:
1 clf_logistic = linear_model.LogisticRegression(multi_class='multinomial',solver='newton
In [26]:
1 clf_logistic = clf_logistic.fit(train_x,train_y)
In [27]:
1 metrics.accuracy_score(test_y,clf_logistic.predict(test_x))
Out[27]:
0.7777777777777778
Hmm, not too bad but not too great either, this means we are lacking in features and data.