Vous êtes sur la page 1sur 17

2012

Predicting debt crises

Konstantinos Stavrou (70134) Abo Akademi 12/2/2012

Problem Definition
In summary the Head of Division expects me to identify which countries are vulnerable to a financial crisis in 2013 (by predicting a crisis factor in 2012). I have been given historical data from 1989 until 2011. My aim is to use the financial data of year 2012 and predict if a country will undergo a financial crisis in 2013. To achieve this I have to become familiar with the data first and then I will be able to make the proper transformations in order to achieve better quality inputs. Most of the data preparation has been already done by my assistant but my input on this will be critical for the success of this process. Then comes the most important part. I have to create 3 different models (logistic regression, decision tree and support vector machine) and train them using part of the data, assess their usefulness for policy makers and then choose the most suitable and use its threshold to run the models with the test data. The model which has the highest usefulness during that last test cycle will be the one used to predict the C12 crisis factor for 2012.

Data preparation
Below is a snapshot of the data as loaded into the R environment.

Data Exploration

It is pretty obvious that the data do not follow a normal distribution, except for the GDP growth rate which has a positive skewness. Also most of the data have a lot of outliers which have a huge deviation from the mean. Data with some outliers is expected but some of these outliers are really far away from the mean value, let alone their big number. The data will have to undergo transformation, and most probably eliminate some outliers, although that should be done with extreme caution since we might omit important data. Also the data appear to have leptokurtic distribution which means that a big portion of the values is gathered near the mean value which means that they are not evenly distributed.

The numerical data confirm what we have observed in the plots. Standard deviation is extremely huge and although most of the values are gathered near the mean (thats why the quarters have small numbers) the maximum and minimum are far off. The fact that most of the values are gathered near the mean value is also supported by the big values in kurtosis.

At a first look we can understand that it is difficult to discriminate between the two classes and they seem to overlap a lot. As we can see the worst discriminative indicators are money growth and inflation rate. You can barely discriminate between red and black. The discrimination in other bivariate plots are much better with external debt and current account, GDP and current account, as well as external debt and GDP being the best choices.

The correlation matrix below confirms our observations since inflation rate and money growth are dependent in each other and cant help us in modelling or discriminating between the 2 outcomes and external debt, current account and GDP have a slight negative correlation that is much more helpful in discriminating between the outcomes.

By looking at the scatter plot and the box plot below we understand that there is no strong discriminatory power. We could say though that a minor discrimination can be achieved with money growth and external debt, which is clearer in the box plot.

> t.test(CC.dat[,3]~CC.dat[,8])$p.value [1] 1.467075e-10 > t.test(CC.dat[,4]~CC.dat[,8])$p.value [1] 0.01372336 > t.test(CC.dat[,5]~CC.dat[,8])$p.value [1] 0.01069683 > t.test(CC.dat[,6]~CC.dat[,8])$p.value [1] 3.95334e-31 > t.test(CC.dat[,7]~CC.dat[,8])$p.value [1] 2.894274e-08 The p-value for all the mean values is below 0,05 meaning that all the means are significantly different and outcome of the crisis factor is not left in chance but is actually dependent on those factors.

Modelling
Logistic Regression
Transformation to percentiles: Coefficients: Estimate (Intercept) gdp_gr inf mch ted_res cac_res Std. Error z value Pr(>|z|) Significance

-1.626143 0.365680 -4.447 -1.097438 0.312746 -3.509 1.149994 0.396555 2.900

8.71e-06 *** 0.00045 0.00373 0.98744 *** . *** **

-0.005987 0.380347 -0.016 3.788389

0.335819 11.281 < 2e-16 0.05638

-0.593096 0.310832 -1.908

Transformation to mean 0 and standard deviation 1: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) gdp_gr inf mch ted_res cac_res 0.17612 0.08886 1.982 -4.709 -1.219 1.237 9.321 -1.077 0.0475 Significance *

-0.39488 0.08385 -0.30018 0.24625 0.36396 1.86458 0.29429 0.20005

2.49e-06 *** 0.2228 0.2162 < 2e-16 0.2816 ***

-0.14948 0.13882

What we can understand from both of the logistic regressions is that GDP growth rate and external debt are the most significant factors that affect the C12 factor. These also seem to follow the observations during the data exploration where we concluded that GDP and external debt are important in differentiating between crisis and tranquil periods

Decision Tree
Transformation to percentiles:

It seems that the cross validation pruned the tree and removed every other branch that has inflation 0.486. In other words it reduced the complexity and the accuracy of the tree. It is important to not over fit our data.

Transformation to mean 0 with standard deviation 1:

Once again the cross validation pruned the tree but it left more branches than before. This time it pruned the branches that were accessed after assessing if the ted_res is -0.1779.

Support Vector Machine (SVM)


Tested values gamma = c(0.2,0.3,0.5,0.7,1.2,1.5), cost = c(0.5,0.8,1,10^(1:2)) Transformation to percentiles:
Parameters: SVM-Type: SVM-Kernel: cost: gamma: C-classification radial 1 0.2 540

Number of Support Vectors: ( 266 274 ) Number of Classes: Levels: 0 1 2

10-fold cross-validation on training data: Total Accuracy: 70.38835 Single Accuracies: 62.19512 75.60976 65.06024 75.60976 67.46988 75.60976 69.5122 67.46988 74.39024 71.08434

Transformation to mean 0 and standard deviation 1:


Parameters: SVM-Type: SVM-Kernel: cost: gamma: C-classification radial 1 1.5 592

Number of Support Vectors: ( 266 326 ) Number of Classes: Levels: 0 1 2

10-fold cross-validation on training data: Total Accuracy: 71.23786 Single Accuracies: 69.5122 71.95122 73.49398 73.17073 62.6506 75.60976 75.60976 68.6747 73.17073 68.6747

In general we dont get better results with SVM even by testing with different costs and gamma values. Strange thing is that Panama showed up as having 80% probability to be in a pre-Crisis period which is strange considering that the rest of the factors dont indicate something like that, nor did he show up on our best model.

Evaluation
For each data transformation we evaluate each model and choose the one that provides the maximum usefulness for policy makers. We always take into consideration maximum usefulness for a policy maker that is equally concerned about both outcomes. If this, for a certain value of threshold, equals with another threshold value then we will choose the threshold that maximizes usefulness for the policy maker that is concerned more about false negatives because considering a country falsely as economically stable will have more dreadful implications than considering a stable country as unstable. Generally we want to focus our attention more on mu=0,6 in order to choose a model that is better at predicting crises.

Logistic Regression
Transforming data to percentiles: Train set:
Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,57 329 272 140 83 0,660 0,799 0,766 0,701 0,729 0,072 0,115 0,058 0,799 0,58 333 268 144 79 0,650 0,808 0,772 0,698 0,729 0,073 0,115 0,057 0,799

Test set:
Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,57 105 10 17 6 0,370 0,946 0,625 0,861 0,833 0,058 0,079 0,000 0,747 0,58 105 10 17 6 0,370 0,946 0,625 0,861 0,833 0,058 0,079 0,000 0,747

Scaling data to mean 0 and standard deviation 1: Train set:


Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,45 275 317 95 137 0,769 0,667 0,698 0,743 0,718 0,054 0,109 0,064 0,793

Test set:
Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,45 93 14 13 18 0,519 0,838 0,438 0,877 0,775 0,055 0,089 0,023 0,729

Logistic Regression overall has pretty good results. It is not that much affected by the transformation of the input variables as the other methods. I could say that if we dont want a model that depends on the transformation that much then we would choose logistic regression.

Decision Tree
Transforming data to percentiles: Train set:
Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,57 335 279 133 77 0,677 0,813 0,784 0,716 0,745 0,079 0,123 0,066 0,793

Test set: Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,57 103 11 16 8 0,407 0,928 0,579 0,866 0,826 0,060 0,084 0,008 0,791

Scaling data to mean 0 and standard deviation 1: Train set:


Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,50 318 321 91 94 0,779 0,772 0,773 0,778 0,775 0,087 0,138 0,088 0,806

Test set:
Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,50 91 17 10 20 0,630 0,820 0,459 0,901 0,783 0,072 0,112 0,053 0,744

Using decision tree with the second data transformation provides us with the best usefulness on the test set. Usefulness is always positive meaning that in every occasion (policy maker concerned equally for both outcomes, policy maker concerned more about crises and policy maker concerned more about tranquil periods) our model is better than their best guess. Also usefulness for each policy maker (from the three cases described above) is also maximized with this model. So it passes our evaluation with flying colors.

Support Vector Machine


Transforming data to percentiles: Train set:
Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,48 305 322 90 107 0,782 0,740 0,751 0,772 0,761 0,078 0,130 0,083 0,843

Test set: Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,48 95 16 11 16 0,593 0,856 0,500 0,896 0,804 0,075 0,112 0,049 0,811

Scaling data to mean 0 and standard deviation 1: Train set:


Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,57 345 291 121 67 0,706 0,837 0,813 0,740 0,772 0,092 0,136 0,079 0,836 0,59 355 281 131 57 0,682 0,862 0,831 0,730 0,772 0,095 0,136 0,077 0,836

Test set:
Threshold TN TP FN FP RP RN PP PN ACC mu=0,40 mu=0,5 mu=0,60 AUC 0,57 98 8 19 13 0,296 0,883 0,381 0,838 0,768 0,024 0,045 -0,035 0,719 0,59 102 6 21 9 0,222 0,919 0,400 0,829 0,783 0,020 0,035 -0,050 0,719

Generally SVMs have the best performance on the train sets but the fail horribly at the test sets. Especially the SVM model with the scaling transformation has a negative usefulness for a policy maker that is more concerned about crises meaning that it has a worse prediction that his best guess.

About the data transformation we could say that scaling to mean 0 and standard deviation 1 provides better results except for the Support Vector Machines. I would say that support vector machines are great at discovering the attributes and the behavior of the data, as they perform better at the train set, but their performance is far lower when used to predict the class of the test data. Logistic Regression has more stable results. I would say that I was surprised about the results of SVMs. Since they can understand the underlying distribution

they should not have these deviations. But the results from the data exploration kind of justify it, since the data seem to have too much noise.

Deployment
We concluded that the best model in this occasion is the Decision Tree with data transformed to mean 0 and standard deviation 1. Below is the code with the results of our prediction.

Code
############### INITIALIZATION ############### library("e1071") library("rpart") library("caret") setwd() source("eval.R") ## End of initialization ## ############################################## # Scale data to mean 0 and standard deviation 1, and divide into train and test sets CC.dat <- input_data CC.dat$C12 <- as.factor(CC.dat$C12) CC.dat[,3:7] <- scale(CC.dat[,3:7]) CC.dat.train <- CC.dat[1:824,] CC.dat.test <- CC.dat[825:962,] CC.dat.predict <- CC.dat[963:1008,] # grow decision tree and plot it. windows() fit <- rpart(C12 ~ gdp_gr + inf + mch + ted_res + cac_res, method="class", data=CC.dat.train) plot(fit, uniform=TRUE, main="Classification Tree for debt crises") text(fit, use.n=TRUE, all=TRUE, cex=.8) # prune the decision tree not to overfit (uses 10-fold cross validation). windows() pfit<- prune(fit, cp=fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]) plot(pfit, uniform=TRUE, main="Pruned Classification Tree for debt crises") text(pfit, use.n=TRUE, all=TRUE, cex=.8) # Probabilities on 2012 data, cannot be evaluated since no knowledge on future crises is available. C12.fit.predict <- predict(pfit, CC.dat.predict)[,2] #deploy CC.dat.predict[,9] <- C12.fit.predict CC.dat.predict write.csv(CC.dat.predict, file="predict.csv", row.names=F)

Results
Country Argentina Dominican Republic Ecuador Brazil Jamaica Turkey Uruguay Bolivia Estonia Hungary Indonesia Kazakhstan Latvia Panama Philippines Tunisia Year 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 gdp_gr 1,14886 -0,93375 -0,07267 -0,51322 -0,21285 0,56813 -0,11272 -0,11272 0,40793 -0,01260 0,36788 1,24898 0,88853 0,24773 0,32783 0,50806 inf -0,12859 -0,10201 -0,13903 -0,12593 -0,12954 -0,10599 -0,11720 -0,14777 -0,15156 -0,14511 -0,14112 -0,14188 -0,14853 -0,15213 -0,14834 -0,14872 mch -0,03188 0,00396 -0,14860 -0,16718 -0,16130 -0,05682 -0,08157 -0,12411 -0,14198 -0,14939 -0,12965 -0,10674 -0,12861 -0,15013 -0,15417 -0,16334 ted_res 0,58162 2,07632 0,91126 -0,25898 -0,27134 -0,34084 -0,15495 -0,20999 -0,22412 -0,40299 -0,38803 -0,27996 -0,14375 0,21163 -0,38751 -0,21916 cac_res 0,94781 4,00402 -0,09373 0,42708 -0,36547 0,09202 0,36478 0,37301 -0,62336 -0,29821 0,56639 0,32858 -0,32082 -0,14467 0,55583 0,07199 C12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Prob 0,87755 0,87755 0,87755 0,66875 0,66875 0,66875 0,66875 0,41250 0,41250 0,41250 0,41250 0,41250 0,41250 0,41250 0,41250 0,41250

The rest of the table contains countries that have a 20%, or below, probability to undergo a crisis the next year. According to the C12 factor from 2011 the countries which are in a pre-crisis period are Argentina, Bolivia, Brazil, Indonesia, Jordan, Pakistan, Turkey and Uruguay. According to our result Argentina will most probably continue to be in a pre-crisis period meaning that her economy is still unstable thus indicating the highest possibility for an economic crisis. I would say next in order are Turkey, Brazil and Uruguay since they were in a pre-crisis period and the have a relatively high indicator to become even more unstable. Although Dominican Republic indicates crisis behavior, I wouldnt worry that much since it has a quite high current account. Most possibly it is classified like that because of the GDP growth and the inflation rate. Even its money rate is positive. I wouldnt worry that much for the other countries since the have a low probability for a crisis and their economic attributes seem pretty fine. Their GDP has a good rate and their external debt is declining.

Vous aimerez peut-être aussi