Forecasting House Prices in California Starting From Fundamentals

Contents
Introduction 1
Huge Losses, Rational Investors. 3
The Model.. 6
HMDA Data and the Financial Crisis.. 17
Explaining the Errors.. 17
Conclusion.............................................................................................. 19
References.. 20
R Packages used and relative documentation .......................22
APPENDIX .............................................................................................23
R Code for the Models .......................................................................26
Introduction
The aim of the following thesis is to model house prices in the state of
California and find indication if the price depends on common economic
fundamental factors such as GDP growth, interest rates and home supply
and if the decline in price after the financial crisis of 2007-2009 was due
to a corresponding sudden change in fundamentals or to other external
factors.
House prices have always been a debated topic and housing is worth
treating as a topic because it has been shown to both be a major asset in
household portfolios Englund et al. (2002) and Flavin and Yamashita
(2002). To give a quick perspective the total mortgage outstanding as of
the first quarter of 2017 in the US is just a little over 14 Trillion as
reported by the FED .
The California real estate market is interesting not only because it is home
to period boom and bust cycles, in some way due to the speculation
caused by the rapid GDP growth, but also because it has experienced a
steady growth in prices almost reaching pre-crisis levels as of the first
quarter in 2017.
Helbling and Terrones (2003) have recorded in the years from 1959 to
2002 52 equity busts in 19 different countries, roughly one crash every
13 years per country, considering as bust a contraction of over 37% of the
equity prices (A big chunk of them were due to the breaking down of the
Bretton Woods accord and the relative pegged currencies, they also find
that there is a weak correlation between boom and busts and that most
booms just ended up deflating in the subsequent bear markets). By
contrast from the period 1970 to 2002 they found 20 housing crashes
(against 20 equity ones in the same period) over 12 countries, using
threshold of a 14% price decrease because real estate markets are usually
less volatile. They also found out two important aspects of housing busts:
at least in the sample examined, there is a higher correlation between a
boom and bust in real estate markets than in equity markets, with a 40%
1
chance of a bust following a booming period in house prices. They also
found that the output effects associated with a housing bust were twice as
large as the ones expected after an equity market crash.
The financial crisis of 2007 has once again reminded the world that a
crash in the housing market can quickly have repercussions in other
markets and on the economy. In 2008 the U.S. real GDP grew only 0.4%
and it then decreased by over 5% at annual rates in the fourth quarter of
2008 and the first quarter of 2009. Unemployment also skyrocketed in
less than two years from 4.9% in December 2007 to 9.5% in June 2009.
Jeff Holt (2009) in a short overview of the literature on the financial crisis
at the time identifies four major causes of the crisis:
1. Low mortgage interest rates

2. Low short term interest rates
3. Relaxed standards for mortgage
4. Irrational exuberance
(1)The mortgage interest rate hit record low and was kept a those level
despite overall declining savings in the US thanks to the influx of capital
from foreign countries such as : Japan, United Kingdom, China, Brazil.
(2) The FED pursued an expansive monetary policy to get out of the
recession of 2001. The low rates have incentivized leveraging to pursue
higher returns and the use of adjustable rate mortgages (ARMs) became
widespread. The latter allowed lower payments at the start of the
mortgage increasing the potential demand for housing.
(3) Increased competition in the mortgage market due to the internet and
governmental policies aimed at increasing the number home-owners
among lower income households have made it much easier to obtain a
mortgage. The result was an increase in securitization and the percentage
of loans to lower-income families in the portfolios of Fannie Mae and
Freddie Mac.
(4) Irrational exuberance defined by Shiller (2005) as a heightened state
2
of speculative fervor. The imputed effect is that all the agents somehow
believed in rising house prices, increasing speculation, securitization and
lowering credit standards fueling what is sometimes called as a bubble.
Credit rating agencies continued to give AAA rating to securities backed by
subprime loans on the assumption that house prices would continue to
rise (historically subprime loans are given to people with a bad credit
history and have higher rates, but also an average percentage of default
ten times more than other mortgages). Eventually financial firms were
forced to extend their loans to increasingly dubious individuals to keep
their revenues from fees high because of the limited supply of loan takers
with a good credit score. When the first losses manifested and house
prices peaked, the default rate on loans skyrocketed whoever purchased
securitized loans had suddenly huge losses also due to the high leverage,
the bubble then burst.
Choi(2010) provides an overview of the Californian housing market in

years surrounding the crisis, the market has suffered more than a 30%
loss from the peak. California was also hit by a record high foreclosure
rate in the third quarter of 2009 coming from one of the lowest
foreclosure rates in 2006 with spillover effects also in crime and increase
in prices.
Huge Losses, Rational Investors

Andrew W. Lo et al (2013) built a theoretical model where the above
effect, the ratchet effect, can be obtained in the home equity market
even when all participants are rational and the above conditions of rising
home prices, declining interest rates and near-frictionless refinancing
opportunities create an artificial level of leverage where home owners are
locked-in their position and the possibilities of very large losses arise.
This is due to the peculiar characteristics of homes as a financial asset:
3
indivisibility and occupant-ownership of residential real estate, losses
moreover are exacerbated perhaps also by the nature of homes as a self-
consumption good making the market relatively illiquid.
The owner is usually the sole equity holder and to avoid problems of moral
hazard he is not in a position to raise external capital in the form of
equity, when comparing home ownership with other exchange traded
instruments one can already see that there is not counterpart of the
maintenance margin in case of residential real estate. The combination of
effects makes it so that when home owners are able to easily cash out
through equity refinancing they are always at the top of the peak with
high house prices and low interest rates and high leverage.
Greenspan and Kennedy (2008) document how mortgage debt has
increased more than home value and attribute this effect to equity
extraction via home sales and cash-out refinancing.
As soon as the house price declines the home equity cushion is wiped out
and the borrower defaults on its position. Since this is equivalent to taking
out a mortgage at the peak of the market borrower defaults become
highly correlated with each other with respect to a situation where no
equity extraction was allowed. The simulation made by Andre W. Lo et al
shows how losses of the magnitude of those encountered in the financial
crisis are possible in such a scenario, estimating losses of 1.7 Trillion with
frictionless home refinancing with respect to 330 Billion when refinancing
is forbidden.
Doblas-Madird (2012) builds a model of bubbles in asset markets on the

work of Abreu and Brunnermeier (2003) where agents are rational and
prices are dictated by simple supply and demand at all times. Agents
make probabilistic assumptions about the existence of a bubble in time
and sometimes capture a private signal that confirms in their beliefs that
the market is in a bubble, even in this case if the signal is perceived early
enough the rational choice would be to ride the bubble for a profit. It is
important to note that agents receive a periodic endowment that they can
4
invest in the risky asset and this is awfully similar to what happens in in
the residential housing markets with equity refinancing.
The improvement of Doblas (2012) over Abreau and Brunnermeier (2003)
rests on the fact that his model does not need to rest on behavioral
agents that act irrationally and are always caught in when the bubble
bursts. The key findings are that in the above mentioned model even
without noise bubbly and non-bubbly equilibria can coexist, adding the
former just facilitates the emergence of bubbly equilibria.
Since it is possible at least in theory that a bubble can arise even when
markets are populated by rational investors, it is important to check what
these theoretical rational investors look for in in the real estate market in
reality. For the reasons mentions at the start of this text I believe
California to be a good test field.
Fundamental Drivers of House Prices
Capozza, Hendershott and Mack (CMH, 2004) in their summary fo the

literature regarding the fundamental drivers of house prices find wide
consensus in the literature that as employment and population grow, rent
and prices should also increase. The last two should also increase with
income and move in the opposite direction with respect to some measure
of the cost of capital.
Kase and Shiller (2006) find that a similar set of variables seem to affect
housing prices even though they focus much more on the role of
expectations as a driving force of demand. They argue that a speculative
bubble in housing markets might take place if expectations of large future
price increases are enough to sustain the demand, for example some
people might find it. They find evidence that the the run up in home prices
in the period 1985-2002 can be actually traced back changes in income at
least in a majority of states and move in line, while in the following states:
New England, New York, New Jersey, California and Hawaii income does
5
not seem to be the major driver and house prices are consequently more
volatile, Table 1 in the APPENDIX summarizes their findings and is taken
directly from their work.
In the following model I also find income as a statistical significant
variable in explaining house prices, but while significant when forecasting
prices it adds complexity for a very small gain in explanatory power.
The Model
As a starting point I take 4 time series from the Federal Reserve Bank of
St.Louis regarding the state of California: the CATSHPI as a proxy for the
general level of house prices An all transaction Index that includes the
price of appraisal differently from the famous Shiller index that only
includes prices of transactions, Total Income and the number of new
houses authorized for building in the previous period CATSHPI. We also
include the MORTGAGE30US index (30-Year Fixed Rate Mortgage Average
in the United States) as a proxy for long term interest rate.
The time series run from 1988 to 2015 and we adapt all of them to
quarterly data to keep consistency by dropping all the observations that
do not fall at the end of one quarter. This is to avoid using data
imputation techniques such as the mean substitution or regression
imputation that might sway the coefficients one way or the other.
I also divide the data and use as training set the years running from 1988
to 2005 and utilize the remaining data to test our model.
The reason for this is twofold: is our model able to predict the sudden
burst of the bubble in the years of the crisis 2006-2007? If the model is
robust enough and is based on those same fundamentals then we expect
it to be able to predict a downturn based on those factors. If it is not, then
is there something else that has changed in the period not captured by
6
the model? This paper will explore the possibilities that the HMDA data at
single loan level detail can point to some answers for the second question.
As first analysis the time series need to be tested for non-stationarity. I

use the Augmented Dickey-Fuller test for the null to check if our time
series has a unit root (for more information look at the documentation of
the R package tseries ). The results can be found in the tables in the
appendix, from the test it is clear that our series are non-stationary.
First order differencing seems to resolve most of stationarity issues for the
4 variables but for the CASTHPI, to resolve the issue we first transform
the series by taking its natural logarithm and then take the first order
difference for the analysis. I concede that trying to linearize an index that
serves as a proxy for housing prices means making the perhaps too
stringent assumption that they follow a pattern of exponential growth,
however it does make our series behave similarly to a stationary one for
the purposes of this paper with an ADF-test p-value of 0.08.
The graphs below show the effects of the transformations on our test
data:
7
After confirming the time-series stationarity I fit a multivariate linear
model on our data with the inclusion of only lagged terms using the OLS
method. The inclusion of lagged terms is done in order to be able to
produce a forecast for the future and test the validity of our model with
respect to the unused sample data from 2005 onwards. The objective of
this model is to predict the index in the next period or in our case the next
quarter.
As a starting point we include all the factors lagged by one period, the
regression results are summarized in the following table:
Regression 1
8
While I obtain an overall good adjusted R-squared of 0.697 most of our
factors are non-significant and cast doubts on the actual usefulness of our
regressors. I then try to drop every variable but for the CASTHPI,
obtaining a pure autoregressive model. I also checked if there were
significant improvements in explanatory power of the model with the
addition of additional AR(k) terms, but the marginal benefits of doing so
were almost null, the regression outputs are summarized in the appendix.
In the end the multivariate linear model with the best goodness of fit
formed by the four times series could be a simple AR(1), the best
predictor of the house price index in the next quarter is found to be the
price of the index today.
Regression 2
These results indicate that while fundamental variables might have an

effect on house prices, the ones considered in Regression 1 above
increase the Adjusted R-Squared by only a 2 percent with respect to
Regression 2 and all of them with the exclusion of Total Income are
statistically insignificant. This seems to point to the fact that it might be
better to drop the fundamental variables of U.S. average mortgage
1
interest rates when trying to forecast house price in California .
While the simple AR(1) model explains the majority of the change in the
1
It could simply mean that Californias market is completely independent of the U.S. one regarding mortgage
and that home buyers simply do not bother to check how many houses have been approved to be built in the
next period. The latter might be because of the nature of homes as a consumption good, one cannot expect the
average buyer to be able to wait till new houses are constructed to settle in.
9
once-differentiated time series, for the sake of completeness it is actually
possible to obtain an R-squared of over 99% by not differentiating and
taking the time series as they are.
In the economic While the obvious high correlations between the
regressors have a very high chance of resulting in a spurious regression;
it is still interesting to see how the model behaves when confronted with
the test dataset (2005-2015).
This is the output of the regression with lagged variables from 1988 to
2005:
When using the model to make predictions on the test sample, the results
were a consistent overestimation of the price. This is not surprising
considering that the training sample included the wild run-up in prices
before the bubble burst. Just by inspection the model seems not only
unable to predict the burst of the bubble, but completely overshoots the
peak and consistently underestimates losses in the index.
10
Forecast on the test sample using the multivariate model with
lagged variables:
Since in the end we resorted to an autoregressive model of lag one, it

seemed reasonable to explore the whole family of ARIMA models that
could work with the data, including seasonal ones:
ARIMA(p, d, q) (P, D, Q)S
With:
p = non-seasonal AR order
d = non-seasonal differencing
q = non-seasonal MA order
P = seasonal AR order
D = seasonal differencing,
Q = seasonal MA order
S = time span of repeating seasonal pattern
11
Without differencing operations, the model could be written more formally
as
(BS)(B)(xt - ) = (BS)(B)wt
The non-seasonal components are:
AR: (B) = 1 - 1B - ... - pBp
MA: (B) = 1 + 1B + ... + qBq
The seasonal components are:
Seasonal AR: (BS) = 1 - 1BS - ... - PBPS
Seasonal MA: (BS) = 1 + 1BS + ... + QBQS
I used the auto.arima function of the R package forecast to derive the

best model according to the Akaike Information Criteria (AIC) with a
maximum differencing of 4, the space of possible models is searched
using space-wide selection (For further information links to various R
packages are in the APPENDIX) .
It could be argued that using BIC would give a more reliable model
because it penalizes more the use of additional factors. Aho et al. (2014)
explain why it is preferable to choose AIC when the goals of models is
trying to forecast complex systems, in this case over-fitting is a lesser
concern than forecasting accuracy.
On our training dataset the following ARIMA model is returned according
to the criteria above:
12
ARIMA MODEL
CASTHPI
ARIMA(0,2,3)(2,0,0)[4]
Coefficients:
ma1 ma2 ma3 sar1 sar2
-0.3405 -0.3073 0.4496 -0.2449 0.4136
s.e. 0.1176 0.1393 0.1153 0.1838 0.1912
sigma^2 estimated as 19: log likelihood=-192.44

AIC=396.88 AICc=398.28 BIC=410.11
Training set error measures:

ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.4169961 4.131388 2.728491 0.1181862 0.9343395 0.1153471 -0.03150692
From the table above the mean absolute percentage error when
forecasting the next period in the ARIMA model is quite low at 0.93%,
however it would have been weird if the algorithm could not be able to fit
the model to the training data. It is much more significative then to see
what happens when we move out of the training data and see how the
model performs in the test dataset 2005-2015 or in a potential future.
In this case the time series of CASTHPI is called ts_test and the output is
collected in the following table:
ARIMA OUTPUT ON TEST DATASET

Series: ts_test
ARIMA(0,2,3)(2,0,0)[4]
Coefficients:
ma1 ma2 ma3 sar1 sar2
-0.3405 -0.3073 0.4496 -0.2449 0.4136
s.e. 0.0000 0.0000 0.0000 0.0000 0.0000
sigma^2 estimated as 19: log likelihood=-161.51

AIC=325.01 AICc=325.11 BIC=326.78
Training set error measures:

ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.58424 9.807598 7.827307 0.02112604 1.651646 0.1478197 0.42462
As it is expected all the measures of forecasting accuracy get worse,
however even on the test data our model still manages to retain a MAPE
of 1.65%, which is already good enough for real world application.
13
I construct the plot of residuals, the autocorrelation function and the
distribution of the residuals to examine if there is anything going on that
would make the residuals appears as if not drawn from a normal
distribution:
It can be seen that the variance of the residuals seems to increase in the
years right after the crisis in the period 2008-2010. The autocorrelation
functions is interesting because it suggests a strong negative
autocorrelation every 8 periods, if this is an indication of mean reversion
is hard to say because our test sample contains only around 40
observations. As such the histogram of the residuals also is not helpful in
suggesting if the residuals are drawn from the normal distributions or not.
Running a Shapiro-Wilk normality test on the residuals resulted in a p-
value of 0.38, but this is probably also due to the low number of
observations. In this case while it is not strong empirical evidence a QQ-
14
plot is a better indicator for the normality of the residuals.
The values lie mostly on a straight line apart from a few outliers but
considering the small test sample size for the purposes of our model we
can assume that the residuals are more or less normally distributed.
In the end it is also useful to have a look at the predicted values against
the real ones and see how they fare 2. The blue are the predictions from
our model while the black line is the actual observations:
2
Robert Tibshirani and Trevor Hastie in An Introduction to Statistical Learning suggest to always look at plot
before any analysis and I wholeheartedly agree with them.
15
ARIMA MODEL FIT&FORECAST:
Predicted values in blue
Actual Values in black
Two things are also noticeable to the eye of the observer: the model has a
difficult time in predicting next periods value both at the peak in 2006
and after first the sharp decrease in the years 2008-2010 which is
consistent to what we found in the residuals.
In financial literature one factor that is usually blamed for the financial
crisis is the progressive laxer landing standard and predatory behavior of
lenders. If that is truly the case it seems plausible that digging into single
loan level data should shed some light on what is causing the model to
misbehave during the bubble or peak.
If lax lending standards were really one of the fundamental factors of the
crisis I expect to find some correlation between the forecasting errors and
measures of the above mentioned lending standards since the model was
calibrated before the actual crash and is overall consistent with past data.
16
HMDA Data and the Financial Crisis
Since 1975 the Home Mortgage Disclosure Act has required public
disclosures from most mortgage lending institutions based in metropolitan
areas of the US. The information disclosed is made available to help the
public determine if the institutions are adequately serving the populations
financing needs for homes and to help enforce the fair lending laws with a
special regard to racial discrimination. The information disclosed is on the
level of single loan and it contains multiple parameters such as: applicant
race, co applicant race, income, county, census tract, loan amount,
rejection/acceptance and reason for rejection (For the full list of the
parameters please check the APPENDIX).
The dataset I am using is aforementioned HMDA data downloadable
directly from the US government with little manipulation. I took only the
subset concerning the state of California that spans from 2007 to 2015
(Data before that are much harder to retrieve). In this case two
considerations must be made: (1) I was not able to obtain data prior to
2007 that could be easily homogenized to most recent years, both
because of change in reporting standards and formats and because the
data must be downloaded in a format other than CSV and the treatment
of missing values in that raw text file was too time consuming and
energies could better be directed elsewhere (2) Even in the case one was
able to retrieve and clean the data without complications one should still
carefully examine the documentation regarding reporting standards since
by changing over time they might otherwise lead to wrong conclusions.
Explaining the Errors

I ran a logistic regression on the variable Accepted/Rejected of every loan
on annual data from 2007 to 2015 using as factors: the county as an
indication of the geographical location of the model, the income of the
applicant divided in low-mid-high, the amount of the loan, the purpose of
17
the loan and the agency name. Each time the model extracts 1 million
random loans from the HMDA data (the total number of filings each years
is around 2 million) and the regression is run on it to estimate the
coefficients. The accuracy of the model is then tested on another random
sample of 1 million observation assigning a class to our dependent
variable depending if the probability is >0.5 or not. The procedure above
is then repeated for every year.
The usefulness in the logistic regression is that we now can see if the
lenders are somehow changing the way they are screening applicants
looking at the changes in the accuracy of the model per year.
What we find is that in 2007 the model has a very low predictive power
and it gradually increases to 80 percent in year 2015. In the following
table I summarize the accuracy of the logistic regression, the percentage
of loan application rejection in the HMDA and the percentage of loan
applications denied due to collateral (e.g. insufficient collateral or changes
in appraisal value). Regression errors from the ARIMA are the sum of the
total absolute values of the errors in relative year. The goal is to
understand why the model loses accuracy in the years 2008-2010
absolute values are a good way to capture the extent of the divergence
from true values, since we are less concerned with the direction of the
divergence.
Table with logistic regressions accuracy and %rejections per year
Absolute Arima % of % of rejection due to

Year Errors logi_accuracy2 rejection collateral
2007 39.4984605 0.1584 16.33% 17.82%
2008 31.2466815 0.2009 15.59% 26.30%
2009 59.85644084 0.3485 12.11% 29.39%
2010 34.03332795 0.5405 12.23% 23.43%
2011 43.08906036 0.4712 11.07% 23.52%
2012 11.23246536 0.7225 10.65% 20.63%
2013 25.68428571 0.7347 11.53% 17.59%
2014 29.49440117 0.7636 11.24% 15.30%
2015 20.96592701 0.8188 10.00% 14.99%
18
I then construct the correlation matrix and the result shows that the
absolute errors of the ARIMA are negatively correlated with the accuracy
of the model (CORR. COEF. -0.62) while positively correlated with the
rejection rate of the loans (CORR. COEF. 0.31) and more so with the
rejection due to collateral (CORR. COEF. 0.62).
% of rejection
Absolute Arima % of due to
Errors logi_accuracy2 rejection collateral
Absolute Arima Errors 1
logi_accuracy2 -0.620096239 1
% of rejection 0.3101494 -0.881787461 1
% of rejection due to
collateral 0.62338656 -0.602912485 0.260430202 1
Conclusion
The model with the best goodness of fit that I could find was an ARIMA
model trained on the data from 1988 to 2005 that provides a very low
mean percentage error also when forecasting the next period out of our
training sample.
Even after having tried to account for fundamental variables common in
financial literature I found little success in them and there is a number of
reasons for that:
(1) I used the wrong variables. In fact it is debatable the use of long term
interest rates instead of the short term ones because what the market
usually reacts to are announcement from the FED and thelatter can impact
directly only short-term rates. While total income was significant the
number of new homes approved for construction was not, but this could
be due to me going back only one period while the variable could have an
impact many years down the road considering the average time to build a
home, refurbish and sell it is way above 1 quarter. I also ignored
population which is a direct driver of demand for real estate.
(2) I have also mostly ignored or glossed over the price rent-ratio problem
19
and no discussion on house prices can be complete without it. It is
however a complicated matter, because of the variation between the
quality of real estate that is commonly rented out and the quality of the
one used as first home for dwelling purposes. Sorting out all the problems
regarding price-rent ratio and making a reasonable set of assumption
seems like a good direction for a future paper considering also the
extensive literature surrounding the topic.
In the end I found evidence that lending standards do in fact change

through time and that they should not be assumed to be constant,
variation from year to year in the coefficients of the logistic regression and
changes in the overall accuracy are a proof of this.
Morevoer ARIMA models seem to perform a good job in predicting overall
house price index changes, but appear to fail to account for factors that
might depend more on the psyche of investors such as lending standards.
One direction which I would like to explore in the future is to construct
longer time series spanning different countries to have a stronger
empirical evidence. While it is true that correlation was found between
changes in the model, changes in rejection rate and errors of the ARIMA I
was able to only perform my anlysis on 9 years, as such while it might be
a hint on the overall process that govern house prices in California it is in
no way a statistical proof. Further more if one were able to construct a
better predictive model of mortgage acceptance/rejection to replace the
logistic regression, the study of the changes in the models coefficients
could yield a better insight in how lending standards are evolving and
what factors are the one to keep a eye on.
References
1. Aho, K., Derryberry, D., & Peterson, T.. Model selection for ecologists: the
worldviews of AIC and BIC. Ecology (2014), 95(3), 631-636.
20
2. Akaike H. A new look at the statistical model identification. IEEE Trans.
Automat. Contr.AC-19:716-23, 1974. [Institute of Statistical Mathematics,
Minato-ku, Tokyo, Japan]
3. Alan Greenspan, James Kennedy; Sources and uses of equity extracted from
homes, Oxf Rev Econ Policy 2008; 24 (1): 120-144
4. Andrew W. Lo, Amir E. Khandani, Robert C. Merton, Systemic risk and the
refinancing ratchet effect, Journal of Financial Economics 108(2013) pp 29-4
5. Capozza, Dennis R.; Hendershott, Patric H.; Mack, Charlotte (2004). "An
Anatomy of Price Dynamics in Illiquid Markets: Analysis and Evidence from
Local Housing Markets." Real Estate Economics 32(1): 1-32.
6. Choi Laura, The Current Landscape of the California Housing Market,
Federal Reserve Bank of San Francisco, Working paper 2010-03 September,
2010.
7. Doblas-Madrid Antonio, A robust model of bubbles with multidimensional
uncertainty, Econometric, Vol.80 September 2012.
8. Englund, P., M.Hwang and J.M Quigley (2002), Hedging Housing Risk,
Journal of Real Estate Finance and Economics, Vol. 24, pp. 167 200
9. Flavin, M. and T.Yamashita (2002), Owner-Occupied Housing and the
Composition of the Household Portfolio, American Economic Review, Vol. 92,
pp. 345-362
10. Helbling, T., and M. Terrones (2003), When Bubble Burst, in World Economic
Outlook, Chapter 2, pp.61 94, (Washington DC: International Monetary
Fund).
11. Jeff Holt. A Summary of the Primary Causes of the Housing Bubble and the
Resulting Credit Crisis: A Non-Technical Paper. The Journal of Business
Inquiry (2009).
12. John Y. Campbell and Robert J. Shiller. Stock Prices, Earnings and Expected
Dividends. Journal of Finance (1988).
13. Karl Case and Robert Shiller , Is There a Bubble in the Housing Market?
Brookings Papers on Economic Activity, 2003, vol. 34, issue 2, 299-362
14. Shiller, R. J., 2005, Definition of Irrational Exhuberance,
Irrationalexuberance.com
21
15.
R Packages used and relative documentation :
ade4 ( https://cran.r-project.org/web/packages/ade4/ade4.pdf )
biglm ( https://cran.r-project.org/web/packages/biglm/biglm.pdf )
caret ( https://cran.r-project.org/web/packages/caret/caret.pdf )
data.table ( https://cran.r-project.org/web/packages/data.table/data.table.pdf )
dplyr ( https://cran.r-project.org/web/packages/dplyr/dplyr.pdf )
dyn (https://cran.r-project.org/web/packages/dyn/dyn.pdf )
forcats ( https://cran.r-project.org/web/packages/forcats/forcats.pdf )
forecast (https://cran.r-project.org/web/packages/forecast/forecast.pdf )
ggplot2 ( https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf )
MASS ( https://cran.r-project.org/web/packages/MASS/MASS.pdf )
sjPlot (https://cran.r-project.org/web/packages/sjPlot/sjPlot.pdf )
tseries ( https://cran.r-project.org/web/packages/tseries/tseries.pdf )
xlsx ( https://cran.rstudio.com/web/packages/xlsx/xlsx.pdf )
Other sources:
https://stackoverflow.com/questions/5048638/automatically-expanding-an-r-factor-into-a-
collection-of-1-0-indicator-variables to create dummy variables used in the logistic
regression
Textbook: Introduction to Statisical Learning by Robert Tibshirani and Trevor Hastie
22
APPENDIX
Total Personal Income calculation:
Earnings by Place of Work

- Personal Contributions for Social Insurance
+ Residence Adjustment
= Net Earnings by Place of Residence
+ Dividends, Interest and Rent
+ State Unemployment Benefits
+ Transfers less State Unemployment Benefits
= Personal Income
Table 1
23
ADF TESTS
Augmented Dickey-Fuller Test for stationarity on non-transformed time-series
data: MORTGAGE30US
Dickey-Fuller = -2.8711, Lag order = 4, p-value = 0.2211
alternative hypothesis: stationary
data: CASTHPI
Dickey-Fuller = 1.6863, Lag order = 4, p-value = 0.99
data: CABPPRIV
data: CAOTOT
Augmented Dickey-Fuller Test for stationarity on transformed time-series
data: MORTGAGE30US
data: CASTHPI
Dickey-Fuller = 1.6863, Lag order = 4, p-value = 0.99
data: CABPPRIV
data: CAOTOT
24
Other Polynomial models
Various Polynomial fits with ts1 = CATSHPI, ts2 = 30 years mortgage rate, ts3 = new houses
approved for construction, ts5 = Total Income
lm(formula = dyn(ts1 ~ lag(ts1, -1) + lag(ts2, -1) + lag(ts3,

-1) + lag(ts5, -1) + I(lag(ts1, -1)^2) + I(lag(ts1, -1)^3)))
Residuals:
Min 1Q Median 3Q Max
-4.6421 -0.8621 0.2357 0.8448 4.6432
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.194e+02 7.463e+02 -0.696 0.492
lag(ts1, -1) 8.720e+00 1.149e+01 0.759 0.454
lag(ts2, -1) 5.508e-01 7.992e-01 0.689 0.496
lag(ts3, -1) -1.020e-04 1.172e-04 -0.870 0.391
lag(ts5, -1) -3.835e-08 1.184e-08 -3.239 0.003 **
I(lag(ts1, -1)^2) -3.557e-02 5.809e-02 -0.612 0.545
I(lag(ts1, -1)^3) 5.308e-05 9.725e-05 0.546 0.589
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2.223 on 29 degrees of freedom

(147 observations deleted due to missingness)
Multiple R-squared: 0.9823, Adjusted R-squared: 0.9787
F-statistic: 268.7 on 6 and 29 DF, p-value: < 2.2e-16
lm(formula = dyn(ts1 ~ lag(ts1, -1) + lag(ts3, -1) + lag(ts5,
-1) + I(lag(ts1, -1)^2)))
Residuals:
Min 1Q Median 3Q Max
-5.0617 -1.9625 -0.3682 1.0633 10.6334
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.148e+00 8.901e+00 0.241 0.8101
lag(ts1, -1) 8.402e-01 6.805e-02 12.345 < 2e-16 ***
lag(ts3, -1) 7.343e-04 8.901e-05 8.250 2.07e-11 ***
lag(ts5, -1) 1.921e-08 3.301e-09 5.820 2.57e-07 ***
I(lag(ts1, -1)^2) 2.530e-04 1.163e-04 2.176 0.0336 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 2.831 on 59 degrees of freedom

(2 observations deleted due to missingness)
Multiple R-squared: 0.9979, Adjusted R-squared: 0.9977
F-statistic: 6955 on 4 and 59 DF, p-value: < 2.2e-16
25
R CODE FOR THE MODELS
MULTIVARIATE AND ARIMA MODEL
#########final model
library(dyn)
library(tseries)
library(sjPlot)
library(forecast)
#######adjusting time series for modeling

CASTHPI = ts(houseprice$CASTHPI, start = 1988, end = 2005 , frequency = 4)
MORTGAGE30US = ts(interest$MORTGAGE30US, start = 1988, end = 2005 , frequency =

4)
CABPPRIV = ts(newhouses$CABPPRIV, start = 1988, end = 2005 , frequency = 4) #with
frq = 12 we kept all the observations
CAOTOT = ts(totalincome$CAOTOT, start = 1988, end = 2005 , frequency = 4)
install.packages("sjPlot", dependencies = TRUE)
model_lagged_diff2 <- dyn$lm(diff(log(CASTHPI)) ~ diff(lag(log(CASTHPI), -1)) + +

diff(lag(MORTGAGE30US, -1))+ diff(lag(CABPPRIV, -1))+diff(lag(CAOTOT, -1)))
summary(model_lagged_diff2) ## good but not so much
model_lagged_diff3 <- dyn$lm(diff(log(CASTHPI)) ~ diff(lag(log(CASTHPI), -1))+diff

(lag(CAOTOT, -1))) + diff(lag(log(CASTHPI), -2))
summary(model_lagged_diff3)
model_lagged_diff4 <- dyn$lm(diff(log(CASTHPI)) ~ diff(lag(log(CASTHPI), -1)))#+di

ff(lag(log(CASTHPI), -2)))
summary(model_lagged_diff4)
model_lagged_diff <- dyn$lm(CASTHPI ~ diff(lag(log(CASTHPI), -1)) + diff(lag(CABPP

RIV, -1))+ diff(lag(CABPPRIV, -2)))
summary(model_lagged_diff)
model_lagged <- dyn$lm(CASTHPI ~ lag(CASTHPI, -1) + lag(MORTGAGE30US, -1)+ lag(CAB

PPRIV, -1)+lag(CAOTOT, -1))
summary(model_lagged)
##########print cool summary table, diff4 the best one

sjt.lm(model_lagged_diff4)
sjt.lm(model_lagged_diff2)
sjt.lm(model_lagged)
#########Squared model
model_lagged_squared <- dyn$lm(diff(log(CASTHPI)) ~ diff(lag(log(CASTHPI), -1))+di

ff(lag(log(CASTHPI), -2))) + I(diff(lag(log(CASTHPI), -1)))^2+ I(diff(lag(log(CAST
HPI), -2))^2)
summary(model_lagged_squared)
26
#########Testing time series for stationarity all series
adf.test(ts1, alternative = "stationary")
#########Testing
adf.test(diff(ts5), alternative = "stationary")
#########Testing training time series
adf.test(CASTHPI, alternative = "stationary")

adf.test(MORTGAGE30US, alternative = "stationary")
adf.test(CABPPRIV, alternative = "stationary")
adf.test(CAOTOT, alternative = "stationary")
adf.test(diff(log(CASTHPI)), alternative = "stationary")

adf.test(diff(MORTGAGE30US), alternative = "stationary")
adf.test(diff(CABPPRIV), alternative = "stationary")
adf.test(diff(CAOTOT), alternative = "stationary")
#######
adf.test(diff(CASTHPI), alternative = "stationary")
####printing graphs of the stuff above

plot(CASTHPI)
plot(log(CASTHPI))
plot(diff(log(CASTHPI)))
abline(a = 0, b = 0 )
#####Forecast, last model is lagged diff 4

CATSHPI_dataframe <- as.data.frame(ts1)
#newdat = predict(model_lagged_diff4, newdata = ts1,se.fit = TRUE, )#CATSHPI_data
frame, se)
forecasted_hpi = forecast.lm(model_lagged_diff4, newdata = ts1)
plot(diff(log(ts1)),ylim=c(0, 0.15))
points(newdat, col = "blue")
forecasted = ts(newdat, start = 2005 )

originalts = ts(ts1, start = 2005)
forecast_e = forecasted - originalts
sumofdeviation = sum(forecast_e)
########After the failure in forecastin we rely on the autoarima to estimate the b

est model. (We are still using almost stationary series)
#playing with arima models.
automaticARIMA = auto.arima(CASTHPI)
summary(automaticARIMA)
27
####check if aic is good enough
automaticARIMA2 = auto.arima(CASTHPI, ic = "aic" )
summary(automaticARIMA2)
sjt.lm(automaticARIMA)
forecast_AArima <- forecast.Arima(automaticARIMA, h = 50, bootstrap = TRUE)
Summary(forecast_AArima)
plot.forecast(forecast_AArima)
accuracy(forecast_AArima, ts1, test=NULL, d=NULL, D=NULL)

automaticARIMA_log <- auto.arima(log(CASTHPI))
summary(automaticARIMA_log)
forecast_AArima_log <- forecast.Arima(automaticARIMA_log, h = 50, bootstrap = TRU
E)
plot.forecast(forecast_AArima_log)
#Export time series data to make the calculations on excel, talk about impractica
l, but what we want is a model that learns each period
###forecasting ARIMA model with the future data
prediction_ARIMA_auto <- Arima(ts1, model = automaticARIMA)

plot.Arima(prediction_ARIMA_auto)
summary(prediction_ARIMA_auto)
ts_test = window(ts1, 2004, 2015)

test_ARIMA = Arima(ts_test, model = automaticARIMA)
summary(test_ARIMA)
### fit the model to the whole dataset
plot(prediction_ARIMA_auto$x)
lines(fitted(prediction_ARIMA_auto), col="blue")
######only on test data
plot(test_ARIMA$x)
lines(fitted(test_ARIMA), col="blue")
lines(fitted(ets_model_test), col = "red")
###ets model
ets_model <- ets(CASTHPI)

summary(ets_model)
checkresiduals(ets_model)
ets_model_test <- ets(ts_test, model = ets_model)
summary(ets_model_test)
checkresiduals(ets_model_test)
checkresiduals(test_ARIMA)
### check residuals and normality of them
model_lagged_diff2 <- dyn$lm(diff(log(CASTHPI)) ~ diff(lag(log(CASTHPI), -1)) + +
diff(lag(MORTGAGE30US, -1))+ diff(lag(CABPPRIV, -1))+diff(lag(CAOTOT, -1)))
summary(model_lagged_diff2) ## good but not so much
checkresiduals(model_lagged_diff2)
28
ARIMA_residuals <- test_ARIMA$residuals
View(ARIMA_residuals)
qqnorm(ARIMA_residuals )
qqline(ARIMA_residuals)
shapiro.test(ARIMA_residuals) ##ignore, p value related to the low number of obser
vations
write.xlsx(ARIMA_residuals, "D:/residuals_coeff.xlsx", sheetName="Sheet1")
LOGISTIC REGRESSION
library(data.table)
library(ade4)
library(dplyr)
library(forcats)
library(ggplot2)
library(caret)
library(MASS)
#library(biglm)
#####Random Sample #probably found on stackoverflow but it does its job, I am sorr
y i do not remember where I took this.
randomSample = function(df,n) {
return (df[sample(nrow(df), n),])
}
###create null dataframe to store regression coefficients
#Logistic_coefficients <- df(NULL)
logi_regr_coef <- t(as.data.frame(logistic_regression$coefficients))
logi_regr_coef = logi_regr_coef[-1,]
View(logi_regr_coef)
#####Random Sample #probably found on stackoverflow but it does its job, I am sorr
y i do not remember where I took this.
randomSample = function(df,n) {
return (df[sample(nrow(df), n),])
}
#### Loop for regressions
for(i in 2007:2015){
name <-paste("data",i, sep = "")
print(name)
location = paste("D://hmdadata//final//",i, ".csv", sep = "")
Object<-fread(location)
Object<- subset(Object, select = c(action_taken_name, applicant_sex_name, appli

cant_income_000s, loan_amount_000s, loan_purpose_name, agency_name, lien_status_na
me, lien_status_name, county_name, county_name))
#Object<-randomSample(Object, 300000)
Object$action_taken_name <- as.factor(Object$action_taken_name)
Object$applicant_sex_name <- as.factor(Object$applicant_sex_name)
Object$loan_purpose_name <- as.factor(Object$loan_purpose_name)
Object$agency_name <- as.factor(Object$agency_name)
29
Object$lien_status_name <- as.factor(Object$lien_status_name)
Object$county_name <- as.factor(Object$county_name)
Object$applicant_income_000s <- as.numeric(Object$applicant_income_000s)
Object$loan_amount_000s <- as.numeric(Object$loan_amount_000s)
###factoring applicant income and loan amount
Object$action_taken_name = Object$action_taken_name %>% fct_collapse(Not_origina

ted = c("Application approved but not accepted","Application denied by financial i
nstitution", "Application withdrawn by applicant", "File closed for incompleteness
", "Loan purchased by the institution", "Preapproval request approved but not acce
pted", "Preapproval request approved but not accepted", "Preapproval request denie
d by financial institution" ))
Object$applicant_income_000s <- cut(Object$applicant_income_000s,

breaks = c(-Inf, 25, 50, 135, Inf),
labels = c("low", "low-medium", "medium-high
", "high"),
right = FALSE)
#
#
Object$loan_amount_000s <- cut(Object$loan_amount_000s,
breaks = c(-Inf, 250, 400, Inf),
labels = c("low_loan", "medium_loan", "high_loan
"),
right = FALSE)
testsample<-randomSample(Object, 1000000)
Object<-randomSample(Object, 1000000)
###logistic model
assign(paste("data", i, sep = ""), Object)
#calculate model and test on

logistic_regression = glm(formula = Object$action_taken_name ~ Object$applicant_
income_000s + Object$loan_amount_000s + Object$loan_purpose_name+Object$agency_nam
e+Object$county_name,family=binomial(link='logit'), data = Object)
###store coefficients
transpose = t(logistic_regression$coefficients)
logi_regr_coef <- rbind(logi_regr_coef, transpose[1,])
######store accuracy
fitted.results <- predict(logistic_regression,newdata=testsample,type='response
')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
assign(paste("logi_acc", i, sep = ""), fitted.results)
#misClasificError <- rbind(misClasificError, mean(fitted.results != testsample$a
ction_taken_name))
remove(Object)
#######make confusion Matrix
###can write this later

#plot(data2015$loan_amount_000s)
30
#quantile(data2015$loan_amount_000s, c(.33, .66, 1))
#regressiondataframe = acm.disjonctif(data2015)
logistic_regression = glm(formula = action_taken_name ~ applicant_income_000s + lo

an_amount_000s + loan_purpose_name+Object$agency_name+Object$county_name,family=bi
nomial(link='logit'), data = data2015)
summary(logistic_regression)
data2015$action_taken_name = data2015$action_taken_name %>% fct_collapse(Not_origi

nated = c("Application approved but not accepted","Application denied by financial
institution", "Application withdrawn by applicant", "File closed for incompletene
ss", "Loan purchased by the institution", "Preapproval request approved but not ac
cepted", "Preapproval request approved but not accepted", "Preapproval request den
ied by financial institution" ))
##############
####Accuracy test
fitted.results <- predict(logistic_regression,newdata=data2015,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
#misClasificError <- mean(fitted.results != testsample$action_taken_name)
summary(fitted.results)
summary(logi_acc2007)
accuracy_logi1 <- c(summary(logi_acc2007)[4],
summary(logi_acc2008)[4],
summary(logi_acc2015)[4])
accuracy_logi2 <- c(summary(logi_acc2007)[4],

summary(logi_acc2015)[4])
accuracy_logi1
accuracy_logi2
library(xlsx)
write.xlsx(logi_regr_coef, "D:/logregression_coeff.xlsx", sheetName="Sheet1")
31
32

Forecasting House Prices in California Starting From Fundamentals

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Forecasting House Prices in California Starting From Fundamentals

Transféré par

Droits d'auteur :

Formats disponibles

Contents

1. Low mortgage interest rates

Choi(2010) provides an overview of the Californian housing market in

Huge Losses, Rational Investors

Doblas-Madird (2012) builds a model of bubbles in asset markets on the

Fundamental Drivers of House Prices

Capozza, Hendershott and Mack (CMH, 2004) in their summary fo the

As first analysis the time series need to be tested for non-stationarity. I

These results indicate that while fundamental variables might have an

Since in the end we resorted to an autoregressive model of lag one, it

ARIMA(p, d, q) (P, D, Q)S

The non-seasonal components are:

AR: (B) = 1 - 1B - ... - pBp

MA: (B) = 1 + 1B + ... + qBq

The seasonal components are:

Seasonal AR: (BS) = 1 - 1BS - ... - PBPS

Seasonal MA: (BS) = 1 + 1BS + ... + QBQS

I used the auto.arima function of the R package forecast to derive the

sigma^2 estimated as 19: log likelihood=-192.44

Training set error measures:

ARIMA OUTPUT ON TEST DATASET

sigma^2 estimated as 19: log likelihood=-161.51

Training set error measures:

Explaining the Errors

Table with logistic regressions accuracy and %rejections per year

Absolute Arima % of % of rejection due to

In the end I found evidence that lending standards do in fact change

R Packages used and relative documentation :

Earnings by Place of Work

Augmented Dickey-Fuller Test for stationarity on transformed time-series

lm(formula = dyn(ts1 ~ lag(ts1, -1) + lag(ts2, -1) + lag(ts3,

Residual standard error: 2.223 on 29 degrees of freedom

Residual standard error: 2.831 on 59 degrees of freedom

#######adjusting time series for modeling

MORTGAGE30US = ts(interest$MORTGAGE30US, start = 1988, end = 2005 , frequency =

CAOTOT = ts(totalincome$CAOTOT, start = 1988, end = 2005 , frequency = 4)

install.packages("sjPlot", dependencies = TRUE)

model_lagged_diff2 <- dyn$lm(diff(log(CASTHPI)) ~ diff(lag(log(CASTHPI), -1)) + +

model_lagged_diff3 <- dyn$lm(diff(log(CASTHPI)) ~ diff(lag(log(CASTHPI), -1))+diff

model_lagged_diff4 <- dyn$lm(diff(log(CASTHPI)) ~ diff(lag(log(CASTHPI), -1)))#+di

model_lagged_diff <- dyn$lm(CASTHPI ~ diff(lag(log(CASTHPI), -1)) + diff(lag(CABPP

model_lagged <- dyn$lm(CASTHPI ~ lag(CASTHPI, -1) + lag(MORTGAGE30US, -1)+ lag(CAB

##########print cool summary table, diff4 the best one

model_lagged_squared <- dyn$lm(diff(log(CASTHPI)) ~ diff(lag(log(CASTHPI), -1))+di

#########Testing training time series

adf.test(CASTHPI, alternative = "stationary")

adf.test(diff(log(CASTHPI)), alternative = "stationary")

####printing graphs of the stuff above

#####Forecast, last model is lagged diff 4

forecasted = ts(newdat, start = 2005 )

########After the failure in forecastin we rely on the autoarima to estimate the b

accuracy(forecast_AArima, ts1, test=NULL, d=NULL, D=NULL)

###forecasting ARIMA model with the future data

prediction_ARIMA_auto <- Arima(ts1, model = automaticARIMA)

ts_test = window(ts1, 2004, 2015)

ets_model <- ets(CASTHPI)

Object<- subset(Object, select = c(action_taken_name, applicant_sex_name, appli

###factoring applicant income and loan amount

Object$action_taken_name = Object$action_taken_name %>% fct_collapse(Not_origina

Object$applicant_income_000s <- cut(Object$applicant_income_000s,

#calculate model and test on

###can write this later

logistic_regression = glm(formula = action_taken_name ~ applicant_income_000s + lo