Académique Documents
Professionnel Documents
Culture Documents
Introduction 1
Huge Losses, Rational Investors. 3
The Model.. 6
HMDA Data and the Financial Crisis.. 17
Explaining the Errors.. 17
Conclusion.............................................................................................. 19
References.. 20
R Packages used and relative documentation .......................22
APPENDIX .............................................................................................23
R Code for the Models .......................................................................26
Introduction
The aim of the following thesis is to model house prices in the state of
California and find indication if the price depends on common economic
fundamental factors such as GDP growth, interest rates and home supply
and if the decline in price after the financial crisis of 2007-2009 was due
to a corresponding sudden change in fundamentals or to other external
factors.
House prices have always been a debated topic and housing is worth
treating as a topic because it has been shown to both be a major asset in
household portfolios Englund et al. (2002) and Flavin and Yamashita
(2002). To give a quick perspective the total mortgage outstanding as of
the first quarter of 2017 in the US is just a little over 14 Trillion as
reported by the FED .
The California real estate market is interesting not only because it is home
to period boom and bust cycles, in some way due to the speculation
caused by the rapid GDP growth, but also because it has experienced a
steady growth in prices almost reaching pre-crisis levels as of the first
quarter in 2017.
Helbling and Terrones (2003) have recorded in the years from 1959 to
2002 52 equity busts in 19 different countries, roughly one crash every
13 years per country, considering as bust a contraction of over 37% of the
equity prices (A big chunk of them were due to the breaking down of the
Bretton Woods accord and the relative pegged currencies, they also find
that there is a weak correlation between boom and busts and that most
booms just ended up deflating in the subsequent bear markets). By
contrast from the period 1970 to 2002 they found 20 housing crashes
(against 20 equity ones in the same period) over 12 countries, using
threshold of a 14% price decrease because real estate markets are usually
less volatile. They also found out two important aspects of housing busts:
at least in the sample examined, there is a higher correlation between a
boom and bust in real estate markets than in equity markets, with a 40%
1
chance of a bust following a booming period in house prices. They also
found that the output effects associated with a housing bust were twice as
large as the ones expected after an equity market crash.
The financial crisis of 2007 has once again reminded the world that a
crash in the housing market can quickly have repercussions in other
markets and on the economy. In 2008 the U.S. real GDP grew only 0.4%
and it then decreased by over 5% at annual rates in the fourth quarter of
2008 and the first quarter of 2009. Unemployment also skyrocketed in
less than two years from 4.9% in December 2007 to 9.5% in June 2009.
Jeff Holt (2009) in a short overview of the literature on the financial crisis
at the time identifies four major causes of the crisis:
(1)The mortgage interest rate hit record low and was kept a those level
despite overall declining savings in the US thanks to the influx of capital
from foreign countries such as : Japan, United Kingdom, China, Brazil.
(2) The FED pursued an expansive monetary policy to get out of the
recession of 2001. The low rates have incentivized leveraging to pursue
higher returns and the use of adjustable rate mortgages (ARMs) became
widespread. The latter allowed lower payments at the start of the
mortgage increasing the potential demand for housing.
(3) Increased competition in the mortgage market due to the internet and
governmental policies aimed at increasing the number home-owners
among lower income households have made it much easier to obtain a
mortgage. The result was an increase in securitization and the percentage
of loans to lower-income families in the portfolios of Fannie Mae and
Freddie Mac.
(4) Irrational exuberance defined by Shiller (2005) as a heightened state
2
of speculative fervor. The imputed effect is that all the agents somehow
believed in rising house prices, increasing speculation, securitization and
lowering credit standards fueling what is sometimes called as a bubble.
Credit rating agencies continued to give AAA rating to securities backed by
subprime loans on the assumption that house prices would continue to
rise (historically subprime loans are given to people with a bad credit
history and have higher rates, but also an average percentage of default
ten times more than other mortgages). Eventually financial firms were
forced to extend their loans to increasingly dubious individuals to keep
their revenues from fees high because of the limited supply of loan takers
with a good credit score. When the first losses manifested and house
prices peaked, the default rate on loans skyrocketed whoever purchased
securitized loans had suddenly huge losses also due to the high leverage,
the bubble then burst.
3
indivisibility and occupant-ownership of residential real estate, losses
moreover are exacerbated perhaps also by the nature of homes as a self-
consumption good making the market relatively illiquid.
The owner is usually the sole equity holder and to avoid problems of moral
hazard he is not in a position to raise external capital in the form of
equity, when comparing home ownership with other exchange traded
instruments one can already see that there is not counterpart of the
maintenance margin in case of residential real estate. The combination of
effects makes it so that when home owners are able to easily cash out
through equity refinancing they are always at the top of the peak with
high house prices and low interest rates and high leverage.
Greenspan and Kennedy (2008) document how mortgage debt has
increased more than home value and attribute this effect to equity
extraction via home sales and cash-out refinancing.
As soon as the house price declines the home equity cushion is wiped out
and the borrower defaults on its position. Since this is equivalent to taking
out a mortgage at the peak of the market borrower defaults become
highly correlated with each other with respect to a situation where no
equity extraction was allowed. The simulation made by Andre W. Lo et al
shows how losses of the magnitude of those encountered in the financial
crisis are possible in such a scenario, estimating losses of 1.7 Trillion with
frictionless home refinancing with respect to 330 Billion when refinancing
is forbidden.
Since it is possible at least in theory that a bubble can arise even when
markets are populated by rational investors, it is important to check what
these theoretical rational investors look for in in the real estate market in
reality. For the reasons mentions at the start of this text I believe
California to be a good test field.
The Model
As a starting point I take 4 time series from the Federal Reserve Bank of
St.Louis regarding the state of California: the CATSHPI as a proxy for the
general level of house prices An all transaction Index that includes the
price of appraisal differently from the famous Shiller index that only
includes prices of transactions, Total Income and the number of new
houses authorized for building in the previous period CATSHPI. We also
include the MORTGAGE30US index (30-Year Fixed Rate Mortgage Average
in the United States) as a proxy for long term interest rate.
The time series run from 1988 to 2015 and we adapt all of them to
quarterly data to keep consistency by dropping all the observations that
do not fall at the end of one quarter. This is to avoid using data
imputation techniques such as the mean substitution or regression
imputation that might sway the coefficients one way or the other.
I also divide the data and use as training set the years running from 1988
to 2005 and utilize the remaining data to test our model.
The reason for this is twofold: is our model able to predict the sudden
burst of the bubble in the years of the crisis 2006-2007? If the model is
robust enough and is based on those same fundamentals then we expect
it to be able to predict a downturn based on those factors. If it is not, then
is there something else that has changed in the period not captured by
6
the model? This paper will explore the possibilities that the HMDA data at
single loan level detail can point to some answers for the second question.
7
After confirming the time-series stationarity I fit a multivariate linear
model on our data with the inclusion of only lagged terms using the OLS
method. The inclusion of lagged terms is done in order to be able to
produce a forecast for the future and test the validity of our model with
respect to the unused sample data from 2005 onwards. The objective of
this model is to predict the index in the next period or in our case the next
quarter.
As a starting point we include all the factors lagged by one period, the
regression results are summarized in the following table:
Regression 1
8
While I obtain an overall good adjusted R-squared of 0.697 most of our
factors are non-significant and cast doubts on the actual usefulness of our
regressors. I then try to drop every variable but for the CASTHPI,
obtaining a pure autoregressive model. I also checked if there were
significant improvements in explanatory power of the model with the
addition of additional AR(k) terms, but the marginal benefits of doing so
were almost null, the regression outputs are summarized in the appendix.
In the end the multivariate linear model with the best goodness of fit
formed by the four times series could be a simple AR(1), the best
predictor of the house price index in the next quarter is found to be the
price of the index today.
Regression 2
1
It could simply mean that Californias market is completely independent of the U.S. one regarding mortgage
and that home buyers simply do not bother to check how many houses have been approved to be built in the
next period. The latter might be because of the nature of homes as a consumption good, one cannot expect the
average buyer to be able to wait till new houses are constructed to settle in.
9
once-differentiated time series, for the sake of completeness it is actually
possible to obtain an R-squared of over 99% by not differentiating and
taking the time series as they are.
In the economic While the obvious high correlations between the
regressors have a very high chance of resulting in a spurious regression;
it is still interesting to see how the model behaves when confronted with
the test dataset (2005-2015).
This is the output of the regression with lagged variables from 1988 to
2005:
When using the model to make predictions on the test sample, the results
were a consistent overestimation of the price. This is not surprising
considering that the training sample included the wild run-up in prices
before the bubble burst. Just by inspection the model seems not only
unable to predict the burst of the bubble, but completely overshoots the
peak and consistently underestimates losses in the index.
10
Forecast on the test sample using the multivariate model with
lagged variables:
With:
p = non-seasonal AR order
d = non-seasonal differencing
q = non-seasonal MA order
P = seasonal AR order
D = seasonal differencing,
Q = seasonal MA order
S = time span of repeating seasonal pattern
11
Without differencing operations, the model could be written more formally
as
(BS)(B)(xt - ) = (BS)(B)wt
12
ARIMA MODEL
CASTHPI
ARIMA(0,2,3)(2,0,0)[4]
Coefficients:
ma1 ma2 ma3 sar1 sar2
-0.3405 -0.3073 0.4496 -0.2449 0.4136
s.e. 0.1176 0.1393 0.1153 0.1838 0.1912
From the table above the mean absolute percentage error when
forecasting the next period in the ARIMA model is quite low at 0.93%,
however it would have been weird if the algorithm could not be able to fit
the model to the training data. It is much more significative then to see
what happens when we move out of the training data and see how the
model performs in the test dataset 2005-2015 or in a potential future.
In this case the time series of CASTHPI is called ts_test and the output is
collected in the following table:
Coefficients:
ma1 ma2 ma3 sar1 sar2
-0.3405 -0.3073 0.4496 -0.2449 0.4136
s.e. 0.0000 0.0000 0.0000 0.0000 0.0000
It can be seen that the variance of the residuals seems to increase in the
years right after the crisis in the period 2008-2010. The autocorrelation
functions is interesting because it suggests a strong negative
autocorrelation every 8 periods, if this is an indication of mean reversion
is hard to say because our test sample contains only around 40
observations. As such the histogram of the residuals also is not helpful in
suggesting if the residuals are drawn from the normal distributions or not.
Running a Shapiro-Wilk normality test on the residuals resulted in a p-
value of 0.38, but this is probably also due to the low number of
observations. In this case while it is not strong empirical evidence a QQ-
14
plot is a better indicator for the normality of the residuals.
The values lie mostly on a straight line apart from a few outliers but
considering the small test sample size for the purposes of our model we
can assume that the residuals are more or less normally distributed.
In the end it is also useful to have a look at the predicted values against
the real ones and see how they fare 2. The blue are the predictions from
our model while the black line is the actual observations:
2
Robert Tibshirani and Trevor Hastie in An Introduction to Statistical Learning suggest to always look at plot
before any analysis and I wholeheartedly agree with them.
15
ARIMA MODEL FIT&FORECAST:
Predicted values in blue
Actual Values in black
Two things are also noticeable to the eye of the observer: the model has a
difficult time in predicting next periods value both at the peak in 2006
and after first the sharp decrease in the years 2008-2010 which is
consistent to what we found in the residuals.
In financial literature one factor that is usually blamed for the financial
crisis is the progressive laxer landing standard and predatory behavior of
lenders. If that is truly the case it seems plausible that digging into single
loan level data should shed some light on what is causing the model to
misbehave during the bubble or peak.
If lax lending standards were really one of the fundamental factors of the
crisis I expect to find some correlation between the forecasting errors and
measures of the above mentioned lending standards since the model was
calibrated before the actual crash and is overall consistent with past data.
16
HMDA Data and the Financial Crisis
Since 1975 the Home Mortgage Disclosure Act has required public
disclosures from most mortgage lending institutions based in metropolitan
areas of the US. The information disclosed is made available to help the
public determine if the institutions are adequately serving the populations
financing needs for homes and to help enforce the fair lending laws with a
special regard to racial discrimination. The information disclosed is on the
level of single loan and it contains multiple parameters such as: applicant
race, co applicant race, income, county, census tract, loan amount,
rejection/acceptance and reason for rejection (For the full list of the
parameters please check the APPENDIX).
The dataset I am using is aforementioned HMDA data downloadable
directly from the US government with little manipulation. I took only the
subset concerning the state of California that spans from 2007 to 2015
(Data before that are much harder to retrieve). In this case two
considerations must be made: (1) I was not able to obtain data prior to
2007 that could be easily homogenized to most recent years, both
because of change in reporting standards and formats and because the
data must be downloaded in a format other than CSV and the treatment
of missing values in that raw text file was too time consuming and
energies could better be directed elsewhere (2) Even in the case one was
able to retrieve and clean the data without complications one should still
carefully examine the documentation regarding reporting standards since
by changing over time they might otherwise lead to wrong conclusions.
18
I then construct the correlation matrix and the result shows that the
absolute errors of the ARIMA are negatively correlated with the accuracy
of the model (CORR. COEF. -0.62) while positively correlated with the
rejection rate of the loans (CORR. COEF. 0.31) and more so with the
rejection due to collateral (CORR. COEF. 0.62).
% of rejection
Absolute Arima % of due to
Errors logi_accuracy2 rejection collateral
Absolute Arima Errors 1
logi_accuracy2 -0.620096239 1
% of rejection 0.3101494 -0.881787461 1
% of rejection due to
collateral 0.62338656 -0.602912485 0.260430202 1
Conclusion
The model with the best goodness of fit that I could find was an ARIMA
model trained on the data from 1988 to 2005 that provides a very low
mean percentage error also when forecasting the next period out of our
training sample.
Even after having tried to account for fundamental variables common in
financial literature I found little success in them and there is a number of
reasons for that:
(1) I used the wrong variables. In fact it is debatable the use of long term
interest rates instead of the short term ones because what the market
usually reacts to are announcement from the FED and thelatter can impact
directly only short-term rates. While total income was significant the
number of new homes approved for construction was not, but this could
be due to me going back only one period while the variable could have an
impact many years down the road considering the average time to build a
home, refurbish and sell it is way above 1 quarter. I also ignored
population which is a direct driver of demand for real estate.
(2) I have also mostly ignored or glossed over the price rent-ratio problem
19
and no discussion on house prices can be complete without it. It is
however a complicated matter, because of the variation between the
quality of real estate that is commonly rented out and the quality of the
one used as first home for dwelling purposes. Sorting out all the problems
regarding price-rent ratio and making a reasonable set of assumption
seems like a good direction for a future paper considering also the
extensive literature surrounding the topic.
References
1. Aho, K., Derryberry, D., & Peterson, T.. Model selection for ecologists: the
worldviews of AIC and BIC. Ecology (2014), 95(3), 631-636.
20
2. Akaike H. A new look at the statistical model identification. IEEE Trans.
Automat. Contr.AC-19:716-23, 1974. [Institute of Statistical Mathematics,
Minato-ku, Tokyo, Japan]
3. Alan Greenspan, James Kennedy; Sources and uses of equity extracted from
homes, Oxf Rev Econ Policy 2008; 24 (1): 120-144
4. Andrew W. Lo, Amir E. Khandani, Robert C. Merton, Systemic risk and the
refinancing ratchet effect, Journal of Financial Economics 108(2013) pp 29-4
5. Capozza, Dennis R.; Hendershott, Patric H.; Mack, Charlotte (2004). "An
Anatomy of Price Dynamics in Illiquid Markets: Analysis and Evidence from
Local Housing Markets." Real Estate Economics 32(1): 1-32.
6. Choi Laura, The Current Landscape of the California Housing Market,
Federal Reserve Bank of San Francisco, Working paper 2010-03 September,
2010.
7. Doblas-Madrid Antonio, A robust model of bubbles with multidimensional
uncertainty, Econometric, Vol.80 September 2012.
8. Englund, P., M.Hwang and J.M Quigley (2002), Hedging Housing Risk,
Journal of Real Estate Finance and Economics, Vol. 24, pp. 167 200
9. Flavin, M. and T.Yamashita (2002), Owner-Occupied Housing and the
Composition of the Household Portfolio, American Economic Review, Vol. 92,
pp. 345-362
10. Helbling, T., and M. Terrones (2003), When Bubble Burst, in World Economic
Outlook, Chapter 2, pp.61 94, (Washington DC: International Monetary
Fund).
11. Jeff Holt. A Summary of the Primary Causes of the Housing Bubble and the
Resulting Credit Crisis: A Non-Technical Paper. The Journal of Business
Inquiry (2009).
12. John Y. Campbell and Robert J. Shiller. Stock Prices, Earnings and Expected
Dividends. Journal of Finance (1988).
13. Karl Case and Robert Shiller , Is There a Bubble in the Housing Market?
Brookings Papers on Economic Activity, 2003, vol. 34, issue 2, 299-362
14. Shiller, R. J., 2005, Definition of Irrational Exhuberance,
Irrationalexuberance.com
21
15.
ade4 ( https://cran.r-project.org/web/packages/ade4/ade4.pdf )
biglm ( https://cran.r-project.org/web/packages/biglm/biglm.pdf )
caret ( https://cran.r-project.org/web/packages/caret/caret.pdf )
data.table ( https://cran.r-project.org/web/packages/data.table/data.table.pdf )
dplyr ( https://cran.r-project.org/web/packages/dplyr/dplyr.pdf )
dyn (https://cran.r-project.org/web/packages/dyn/dyn.pdf )
forcats ( https://cran.r-project.org/web/packages/forcats/forcats.pdf )
forecast (https://cran.r-project.org/web/packages/forecast/forecast.pdf )
ggplot2 ( https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf )
MASS ( https://cran.r-project.org/web/packages/MASS/MASS.pdf )
sjPlot (https://cran.r-project.org/web/packages/sjPlot/sjPlot.pdf )
tseries ( https://cran.r-project.org/web/packages/tseries/tseries.pdf )
xlsx ( https://cran.rstudio.com/web/packages/xlsx/xlsx.pdf )
Other sources:
https://stackoverflow.com/questions/5048638/automatically-expanding-an-r-factor-into-a-
collection-of-1-0-indicator-variables to create dummy variables used in the logistic
regression
Textbook: Introduction to Statisical Learning by Robert Tibshirani and Trevor Hastie
22
APPENDIX
Total Personal Income calculation:
23
ADF TESTS
Augmented Dickey-Fuller Test for stationarity on non-transformed time-series
data: MORTGAGE30US
Dickey-Fuller = -2.8711, Lag order = 4, p-value = 0.2211
alternative hypothesis: stationary
data: CASTHPI
Dickey-Fuller = 1.6863, Lag order = 4, p-value = 0.99
alternative hypothesis: stationary
data: CABPPRIV
Dickey-Fuller = -2.63, Lag order = 4, p-value = 0.3191
alternative hypothesis: stationary
data: CAOTOT
Dickey-Fuller = -1.4435, Lag order = 4, p-value = 0.8014
alternative hypothesis: stationary
data: MORTGAGE30US
Dickey-Fuller = -2.8711, Lag order = 4, p-value = 0.2211
alternative hypothesis: stationary
data: CASTHPI
Dickey-Fuller = 1.6863, Lag order = 4, p-value = 0.99
alternative hypothesis: stationary
data: CABPPRIV
Dickey-Fuller = -2.63, Lag order = 4, p-value = 0.3191
alternative hypothesis: stationary
data: CAOTOT
Dickey-Fuller = -1.4435, Lag order = 4, p-value = 0.8014
alternative hypothesis: stationary
24
Other Polynomial models
Various Polynomial fits with ts1 = CATSHPI, ts2 = 30 years mortgage rate, ts3 = new houses
approved for construction, ts5 = Total Income
Residuals:
Min 1Q Median 3Q Max
-4.6421 -0.8621 0.2357 0.8448 4.6432
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.194e+02 7.463e+02 -0.696 0.492
lag(ts1, -1) 8.720e+00 1.149e+01 0.759 0.454
lag(ts2, -1) 5.508e-01 7.992e-01 0.689 0.496
lag(ts3, -1) -1.020e-04 1.172e-04 -0.870 0.391
lag(ts5, -1) -3.835e-08 1.184e-08 -3.239 0.003 **
I(lag(ts1, -1)^2) -3.557e-02 5.809e-02 -0.612 0.545
I(lag(ts1, -1)^3) 5.308e-05 9.725e-05 0.546 0.589
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residuals:
Min 1Q Median 3Q Max
-5.0617 -1.9625 -0.3682 1.0633 10.6334
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.148e+00 8.901e+00 0.241 0.8101
lag(ts1, -1) 8.402e-01 6.805e-02 12.345 < 2e-16 ***
lag(ts3, -1) 7.343e-04 8.901e-05 8.250 2.07e-11 ***
lag(ts5, -1) 1.921e-08 3.301e-09 5.820 2.57e-07 ***
I(lag(ts1, -1)^2) 2.530e-04 1.163e-04 2.176 0.0336 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
25
R CODE FOR THE MODELS
MULTIVARIATE AND ARIMA MODEL
#########final model
library(dyn)
library(tseries)
library(sjPlot)
library(forecast)
26
#########Testing time series for stationarity all series
adf.test(ts1, alternative = "stationary")
adf.test(ts2, alternative = "stationary")
adf.test(ts3, alternative = "stationary")
adf.test(ts5, alternative = "stationary")
#########Testing
adf.test(diff(ts5), alternative = "stationary")
adf.test(diff(ts5), alternative = "stationary")
adf.test(diff(ts5), alternative = "stationary")
adf.test(diff(ts5), alternative = "stationary")
adf.test(diff(ts5), alternative = "stationary")
plot(diff(log(CASTHPI)))
abline(a = 0, b = 0 )
automaticARIMA = auto.arima(CASTHPI)
summary(automaticARIMA)
27
####check if aic is good enough
automaticARIMA2 = auto.arima(CASTHPI, ic = "aic" )
summary(automaticARIMA2)
sjt.lm(automaticARIMA)
forecast_AArima <- forecast.Arima(automaticARIMA, h = 50, bootstrap = TRUE)
Summary(forecast_AArima)
plot.forecast(forecast_AArima)
#Export time series data to make the calculations on excel, talk about impractica
l, but what we want is a model that learns each period
plot(test_ARIMA$x)
lines(fitted(test_ARIMA), col="blue")
lines(fitted(ets_model_test), col = "red")
###ets model
checkresiduals(ets_model)
ets_model_test <- ets(ts_test, model = ets_model)
summary(ets_model_test)
checkresiduals(ets_model_test)
checkresiduals(test_ARIMA)
### check residuals and normality of them
model_lagged_diff2 <- dyn$lm(diff(log(CASTHPI)) ~ diff(lag(log(CASTHPI), -1)) + +
diff(lag(MORTGAGE30US, -1))+ diff(lag(CABPPRIV, -1))+diff(lag(CAOTOT, -1)))
summary(model_lagged_diff2) ## good but not so much
checkresiduals(model_lagged_diff2)
28
ARIMA_residuals <- test_ARIMA$residuals
View(ARIMA_residuals)
qqnorm(ARIMA_residuals )
qqline(ARIMA_residuals)
shapiro.test(ARIMA_residuals) ##ignore, p value related to the low number of obser
vations
write.xlsx(ARIMA_residuals, "D:/residuals_coeff.xlsx", sheetName="Sheet1")
LOGISTIC REGRESSION
library(data.table)
library(ade4)
library(dplyr)
library(forcats)
library(ggplot2)
library(caret)
library(MASS)
#library(biglm)
#####Random Sample #probably found on stackoverflow but it does its job, I am sorr
y i do not remember where I took this.
randomSample = function(df,n) {
return (df[sample(nrow(df), n),])
}
###create null dataframe to store regression coefficients
#Logistic_coefficients <- df(NULL)
logi_regr_coef <- t(as.data.frame(logistic_regression$coefficients))
logi_regr_coef = logi_regr_coef[-1,]
View(logi_regr_coef)
#####Random Sample #probably found on stackoverflow but it does its job, I am sorr
y i do not remember where I took this.
randomSample = function(df,n) {
return (df[sample(nrow(df), n),])
}
#### Loop for regressions
for(i in 2007:2015){
name <-paste("data",i, sep = "")
print(name)
location = paste("D://hmdadata//final//",i, ".csv", sep = "")
Object<-fread(location)
#Object<-randomSample(Object, 300000)
Object$action_taken_name <- as.factor(Object$action_taken_name)
Object$applicant_sex_name <- as.factor(Object$applicant_sex_name)
Object$loan_purpose_name <- as.factor(Object$loan_purpose_name)
Object$agency_name <- as.factor(Object$agency_name)
29
Object$lien_status_name <- as.factor(Object$lien_status_name)
Object$county_name <- as.factor(Object$county_name)
Object$applicant_income_000s <- as.numeric(Object$applicant_income_000s)
Object$loan_amount_000s <- as.numeric(Object$loan_amount_000s)
testsample<-randomSample(Object, 1000000)
Object<-randomSample(Object, 1000000)
###logistic model
assign(paste("data", i, sep = ""), Object)
###store coefficients
transpose = t(logistic_regression$coefficients)
logi_regr_coef <- rbind(logi_regr_coef, transpose[1,])
######store accuracy
fitted.results <- predict(logistic_regression,newdata=testsample,type='response
')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
assign(paste("logi_acc", i, sep = ""), fitted.results)
#misClasificError <- rbind(misClasificError, mean(fitted.results != testsample$a
ction_taken_name))
remove(Object)
#######make confusion Matrix
30
#quantile(data2015$loan_amount_000s, c(.33, .66, 1))
#regressiondataframe = acm.disjonctif(data2015)
##############
####Accuracy test
fitted.results <- predict(logistic_regression,newdata=data2015,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
#misClasificError <- mean(fitted.results != testsample$action_taken_name)
summary(fitted.results)
summary(logi_acc2007)
summary(logi_acc2008)
summary(logi_acc2009)
summary(logi_acc2010)
summary(logi_acc2011)
summary(logi_acc2012)
summary(logi_acc2013)
summary(logi_acc2014)
summary(logi_acc2015)
accuracy_logi1 <- c(summary(logi_acc2007)[4],
summary(logi_acc2008)[4],
summary(logi_acc2009)[4],
summary(logi_acc2010)[4],
summary(logi_acc2011)[4],
summary(logi_acc2012)[4],
summary(logi_acc2013)[4],
summary(logi_acc2014)[4],
summary(logi_acc2015)[4])
31
32