BIDM Group5 Seectiona

Contents
Summary of the Project:.......................................................................................... 2

Data Set Profile:...................................................................................................... 2
Problem Statement.................................................................................................. 3
Approach used........................................................................................................ 3
Data Cleaning / Data Preprocessing........................................................................3
Data Reduction........................................................................................................ 4
MODEL BUILDING.................................................................................................... 5
Summary of the Project:
This given dataset was made on the basis of data provided by US Census
Bureau. The data were collected as part of the 1990 US census. These are
mostly counts cumulated at different survey levels. For the purpose of this
data set a level State-Place was used. Data from all states was obtained.
Most of the counts were changed into appropriate proportions.
There are 4 different data sets obtained from this database:
House(8H)
House(8L)
House(16H)
House(16L)
These are all concerned with predicting the median price of the house in the
region based on demographic composition and a state of housing market in
the region.
A number in the name signifies the number of attributes of the data set. A
following letter denotes a very rough approximation to the difficulty of the
task. For Low task difficulty, more correlated attributes were chosen as
signified by univariate smooth fit of that input on the target. Tasks with High
difficulty have had their attributes chosen to make the modelling more
difficult due to higher variance or lower correlation of the inputs to the
target.
Data Set Profile:
It contains 4 tasks, each concerned with predicting the median price of the
house in a small survey region. This dataset consists of data about houses
that are vacant for rent or vacant for sale. The data has been categorized on
the basis of:
State/Location
Age of the residents
Ethnicity of the residents
Marital status
No of members in each household
Rent or Price for each combination of above attributes etc
Origin: US Census Bureau

Usage: assessment
Number of attributes: 139
Number of cases: 22,784

Number of proto tasks: 4
Number of methods run on this dataset: 2
Problem Statement
Identifying the variables that would be helpful in predicting house price

Designing model for predicting the house price
Validation of the model used
Approach used
Data Cleaning
Variable Reduction
Model Building
Data Cleaning / Data Preprocessing

Data cleansing or data cleaning is the process of detecting and correcting (or
removing) corrupt or inaccurate records from a record set, table, or database. It also
refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data
and then replacing, modifying, or deleting the dirty or coarse data.
Problems with given datasets:

Some variables had values which were not inside the specified domain
range.
Some had values not acceptable for its format (e.g. 0 for average family
size)
Process used to clean data:
Wrong data values were identified
Rows containing missing values of dependent variables (house rent and
house price) were completely deleted
Incorrect/missing values of independent variables were replaced by their
respective mean value
Data Reduction
Data reduction is the transformation of numerical or alphabetical digital

information derived empirically or experimentally into a corrected, ordered,
and simplified form. The basic concept is the reduction of multitudinous
amounts of data down to the meaningful parts. the data reduction is
undertaken in the presence of reading or measurement errors. Some idea of
the nature of these errors is needed before the most likely value may be
determined.
Problems in the data set:

A total of 134 independent variables were given in the data set. All variables
should not be used because:
It can lead to spurious/coincidental relation between the variable

and house price thus making our model incorrect and no rigorous
in all cases
Correlation between independent variables can be high. There
is no need to use duplicates of variables for building model
Method used for reducing number of variables:

Variables which did not act as input for the model were removed (like
string variable: State and variable: Pin code)
Variables with 0 variance were deleted as they are not variables. They are
constants.
Factor analysis was used to combine data into components.
A total of 19 components were created which explained 85% of total
variation explained by all variables
Problems with Factor analysis:
It is a black box. No information about bundling of variables can be found
Stepwise regression for cleaning and modelling:
Stepwise regression creates regression models by adding 1 variable at a
time
Each time a variable is added, regression model is run against all the
independent variables in the new set
We can decide the addition to be stopped once there isnt any
significant increase in R-square
We stopped the addition at 14 variables
MODEL BUILDING

BIDM Group5 Seectiona

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

BIDM Group5 Seectiona

Transféré par

Droits d'auteur :

Formats disponibles

Contents

Summary of the Project:.......................................................................................... 2

Summary of the Project:

Data Set Profile:

Origin: US Census Bureau

Number of cases: 22,784

Identifying the variables that would be helpful in predicting house price

Data Cleaning / Data Preprocessing

Problems with given datasets:

Data reduction is the transformation of numerical or alphabetical digital

Problems in the data set:

It can lead to spurious/coincidental relation between the variable

Method used for reducing number of variables:

Vous aimerez peut-être aussi