Académique Documents
Professionnel Documents
Culture Documents
Description of dataset
The dataset contains sales price information on 21613 sales records in King County, USA. Data
contains price (Response variable) and many predictors (e.g. bathrooms, bedrooms, zipcode,
sqft_living etc). Data is stored in a csv file which was imported in sql server using import
functionality.
Data is obtained from Kaggle website and can be downloaded from here.
2. Description of columns
There are 21 variables in the dataset. Three additional columns were created using SQL.
UniqueID was created as primary key, ageHouse to obtain age of house at the time of sale and
HouseType to categorize houses based on sqft_living area. The description of columns is given
below.
3. Data Normalization
Data is not normalized as all the values are contained in single table and columns have high
cardinality. Still table represents one entity and separating it (normalizing) will not be very
beneficial.
Procedure to normalize data:
If data volume increases and we have lots of houses sold again and again, we should separate tables
with those columns where parameters never change throughout the years. We can connect tables
with id column which is individual house id.
Columns to be separated:
Waterfront, yr_built, zipcode, long,lat
Table can be normalized using EF Codds normal forms (removing redundancy, removing
dependability etc).
4. Problems/Issues with data: While importing data from excel into sql server, date
column was imported as nvarchar(255). Also, there were no primary key. Although House
id identifies each house separately, it is not unique as many houses were sold more than
once. I have altered date column from nvarchar to date and also created one primary key
column identified as UniqueID. Code is included in .sql file.
5. Data exploration:
Data exploration was done in SQL. Below is an overview of data exploration
a) Selecting total records, unique houses as well as number of houses sold multiple times in
the dataset
9. Model Diagnostic
10.Summary
MultiLinear regression model fits quite well on this data and r square value is 0.7289.
There are many factors which affect prices, the most important ones are sqft_living, grade, lat, lo
ng, view and waterfront. Surprisingly the effect of Number of bedrooms is negligible and effect
of sqdt_basement is also not impactful. This model is build using multilinear regression but mod
el accuracy can be enhanced further by using cross-validation sampling, randomforest and feature
modification.