Vous êtes sur la page 1sur 11

1.

Description of dataset
The dataset contains sales price information on 21613 sales records in King County, USA. Data
contains price (Response variable) and many predictors (e.g. bathrooms, bedrooms, zipcode,
sqft_living etc). Data is stored in a csv file which was imported in sql server using import
functionality.
Data is obtained from Kaggle website and can be downloaded from here.

2. Description of columns
There are 21 variables in the dataset. Three additional columns were created using SQL.
UniqueID was created as primary key, ageHouse to obtain age of house at the time of sale and
HouseType to categorize houses based on sqft_living area. The description of columns is given
below.

Column Manually Additional


Name Datatype Comments
No. created? comments

1 UniqueID int YES Priamry Key, Identity

ID for individual house,


Multiple records for same
house as houses are sold
2 id float NO more than once
converted from
nvarchar(255)
3 date date NO Date house was sold to date using sql
4 price float NO Response variable
Number of
5 bedrooms float NO Bedrooms/House
Number of
6 bathrooms float NO bathrooms/bedrooms
square footage of the
7 sqft_living float NO home
8 sqft_lot float NO square footage of the lot
Total floors (levels) in
9 floors float NO house
House which has a view to
10 waterfront float NO a waterfront
11 view float NO Has been viewed
How good the condition is
12 condition float NO ( Overall )
overall grade given to the
13 grade float NO housing unit
square footage of house
14 sqft_above float NO apart from basement
square footage of the
15 sqft_basement float NO basement
16 yr_built float NO Built Year

Year when house was


17 yr_renovated float NO renovated
18 zipcode float NO Zip
19 lat float NO Latitude coordinate
20 long float NO Longitude coordinate
21 sqft_living15 float NO Living room area in 2015
22 sqft_lot15 float NO lotSize area in 2015
Age of house at time of computed using
23 ageHouse float YES sale yr_built - date
computed
based on
24 HouseType varchar(6) YES House type baed on area sqft_living

3. Data Normalization
Data is not normalized as all the values are contained in single table and columns have high
cardinality. Still table represents one entity and separating it (normalizing) will not be very
beneficial.
Procedure to normalize data:
If data volume increases and we have lots of houses sold again and again, we should separate tables
with those columns where parameters never change throughout the years. We can connect tables
with id column which is individual house id.
Columns to be separated:
Waterfront, yr_built, zipcode, long,lat
Table can be normalized using EF Codds normal forms (removing redundancy, removing
dependability etc).

4. Problems/Issues with data: While importing data from excel into sql server, date
column was imported as nvarchar(255). Also, there were no primary key. Although House
id identifies each house separately, it is not unique as many houses were sold more than
once. I have altered date column from nvarchar to date and also created one primary key
column identified as UniqueID. Code is included in .sql file.

5. Data exploration:
Data exploration was done in SQL. Below is an overview of data exploration
a) Selecting total records, unique houses as well as number of houses sold multiple times in
the dataset

b) How many houses were sold more than twice?

c) Average price of house by zip code and number of bedrooms

d) Minimum, maxmium and average price by zipcode

e) Features of top 5 costliest houses


f) Features of top 5 cheapest houses

6. Creating additional columns


Below 3 columns were created for data analysis
a) UniqueID: Primary key column
b) ageHouse : year(date)-yr_built
c) HouseType: Small, Medium or Large , based on sqft_living area

7. Exploratory Data Analysis and modelling:


Tableau and R were used to perform below analysis.
Price distribution by House Type: Below histogram shows the distribution of price. It is
observed that prices are right skewed and large houses have higher prices. Also small houses
are not sold beyond 900K. Maximum price for all house types is 7700K(not shown in graph).
Price distribution by zipcode and house type: Prices are high for large houses and low for
small houses for each zipcode. House price depends on zip codes and a small house in one area
can be priced more than a large house in other area.

Distribution of prices over continuous variables:


I. Sqft_living and sqft_above are highly correlated with prices and have linear
relationships.
II. Long and lat are moderately correlated with prices and nearly linear.
III. Sqft_lot and sqft_basement are not much correlated.

Distribution of prices over discrete variables:


I. Grade is highly correlated with price
II. Bathrooms moderately correlated with price. There is an outlier in bedroom with
number of bedrooms equal 33. It seems like a mistake and I updated the data to
bedrooms=3.
III. Condition is highly influential over price
IV. Waterfront values 1 are priced higher than waterfront 0 for smaller prices.
V. Agehouse, floor and bedrooms are having weak relationship with price.
Box and whisker plot of prices by zipcode:
Prices vary highly by zipcodes and all zipcodes have positive outliers. Variance is unequal
over all zipcodes.
Correlation among all variables and prices:
Red are highly positively correlated
Blue are highly negatively correlated
ageHouse is negatively correlated with yr_built( by definition)

8. Finding variables by importance in linear regression model using Cross


validation methods
Below plot shows variables according to their importance. Top 6 variables were used in linear
model in later analysis.
Linear regression model
model <- lm(log(price) ~ log(sqft_living)+grade+lat+long+view+waterfront, data = prices)
summary(model)
Model equation
Price= -84.294612 + 0.458997*log(sqft_living)+ 0.159132*grade+1.473130*long+0.091683*vi
ew+0.400204
*waterfront

Residual standard error: 0.2743 on 21606 degrees of freedom


Multiple R-squared: 0.7289, Adjusted R-squared: 0.7288

9. Model Diagnostic

Residual vs fitted curve looks good. Residuals are scattered randomly.


Normal Q-Q plot is almost normal. Further transformation of variables may help obtaining a
better fit.
Regression line plot:

10.Summary
MultiLinear regression model fits quite well on this data and r square value is 0.7289.
There are many factors which affect prices, the most important ones are sqft_living, grade, lat, lo
ng, view and waterfront. Surprisingly the effect of Number of bedrooms is negligible and effect
of sqdt_basement is also not impactful. This model is build using multilinear regression but mod
el accuracy can be enhanced further by using cross-validation sampling, randomforest and feature
modification.

Vous aimerez peut-être aussi