Vous êtes sur la page 1sur 23

PROJECT - 9

Finance & Risk Analysis


1. Project Objective
The objective of the project is to create India credit risk(default) model using the given training
dataset and validate it on the holdout dataset. Logistic regression framework is to be used to
develop the credit default model.

The data provided in raw-data comprises of financial data.

The below process is to be followed:

1. Exploratory Data Analysis (EDA)


a. Outlier treatment has to be done
b. Missing value treatment has to be done
c. New variables for Profitability, leverage and liquidity has to be created
d. Univariate and Bivariate analysis has to be done
2. Modelling
a. Logistic Regression Model has to be built on important variables
b. Coefficients of important variables have to be analyzed
3. Model Performance Measures
a. The accuracy of the model has to be predicted on the training and
holdout dataset
b. The data has to be sorted in descending order based on probability of
default and then divided into 10 deciles based on probability

2. Directory and dataset creation

2.1.1. Install necessary Packages and Invoke Libraries


The necessary packages were installed and the associated libraries were invoked. Having all the
packages at the same places increases code readability.

2.1.2. Set up working Directory


Setting a working directory on starting of the R session makes importing and
exporting data files and code files easier. Basically, working directory is the
location/ folder on the PC where you have the data, codes etc. related to the
project.
2.1.3. Import and Read the Dataset
3. Exploratory Data Analysis

3.1. Importing Dataset

Setwd(("C:\Users\Bhumika\Documents\Analytics\Project – 9’’)
Raw_datatrain <- read_excel("C:\Users\Bhumika\Documents\Analytics\Project – 9\raw-data.xlsx")
Validation_datatest <- read_excel("C:\Users\Bhumika\Documents\Analytics\Project - 9/validation_data.xlsx")
dim(raw_datatrain)

[1] 3541 52
The dataset contains 3541 observations and 52 variables.

> names(raw_datatrain)
[1] "Num" "Networth Next Year"
[3] "Total assets" "Net worth"
[5] "Total income" "Change in stock"
[7] "Total expenses" "Profit after tax"
[9] "PBDITA" "PBT"
[11] "Cash profit" "PBDITA as % of total income"
[13] "PBT as % of total income" "PAT as % of total income"
[15] "Cash profit as % of total income" "PAT as % of net worth"
[17] "Sales" "Income from financial services"
[19] "Other income" "Total capital"
[21] "Reserves and funds" "Deposits (accepted by commercial
banks)"
[23] "Borrowings" "Current liabilities & provisions"
[25] "Deferred tax liability" "Shareholders funds"
[27] "Cumulative retained profits" "Capital employed"
[29] "TOL/TNW" "Total term liabilities / tangible net worth"
[31] "Contingent liabilities / Net worth (%)" "Contingent liabilities"
[33] "Net fixed assets" "Investments"
[35] "Current assets" "Net working capital"
[37] "Quick ratio (times)" "Current ratio (times)"
[39] "Debt to equity ratio (times)" "Cash to current liabilities (times)"
[41] "Cash to average cost of sales per day" "Creditors turnover"
[43] "Debtors turnover" "Finished goods turnover"
[45] "WIP turnover" "Raw material turnover"
[47] "Shares outstanding" "Equity face value"
[49] "EPS" "Adjusted EPS"
[51] "Total liabilities" "PE on BSE"

> dim(validation_datatest)
[1] 715 52
> names(validation_datatest)
[1] "Num" "Default - 1"
[3] "Total assets" "Net worth"
[5] "Total income" "Change in stock"
[7] "Total expenses" "Profit after tax"
[9] "PBDITA" "PBT"
[11] "Cash profit" "PBDITA as % of total income"
[13] "PBT as % of total income" "PAT as % of total income"
[15] "Cash profit as % of total income" "PAT as % of net worth"
[17] "Sales" "Income from financial services"
[19] "Other income" "Total capital"
[21] "Reserves and funds" "Deposits (accepted by commercial
banks)"
[23] "Borrowings" "Current liabilities & provisions"
[25] "Deferred tax liability" "Shareholders funds"
[27] "Cumulative retained profits" "Capital employed"
[29] "TOL/TNW" "Total term liabilities / tangible net worth"
[31] "Contingent liabilities / Net worth (%)" "Contingent liabilities"
[33] "Net fixed assets" "Investments"
[35] "Current assets" "Net working capital"
[37] "Quick ratio (times)" "Current ratio (times)"
[39] "Debt to equity ratio (times)" "Cash to current liabilities (times)"
[41] "Cash to average cost of sales per day" "Creditors turnover"
[43] "Debtors turnover" "Finished goods turnover"
[45] "WIP turnover" "Raw material turnover"
[47] "Shares outstanding" "Equity face value"
[49] "EPS" "Adjusted EPS"
[51] "Total liabilities" "PE on BSE"

>newtrain <- raw_datatrain


>newtest <- validation_test

The training dataset does not have default variable so the default variable is created by
splitting the observations of the Networth Next Year variable. Generally, it is expected
that the firms that will have negative net worth next year are likely to default. Negative
observations in ‘Networth Next Year’ will be ‘1’ in the Default variable and the positive
observations will be ‘0’ in the Default variable.
>newtrain$Default <- ifelse(newtrain$'Networth Next Year' < 0 ,1,0)

Companies with Total Assets less than 3 is removed from further analysis.
>newtrain <- newtrain(newtrain$`Total assets` <= 3, )
Missing value treatment

> newtrain<-as.data.frame(newtrain)
> for(i in 1:length(newtrain)){
+ print(paste(colnames(newtrain[i]),class(newtrain[,i])))}

[2] "Num numeric"


[1] "Networth Next Year numeric"
[1] "Total assets numeric"
[1] "Net worth numeric"
[1] "Total income numeric"
[1] "Change in stock numeric"
[1] "Total expenses numeric"
[1] "Profit after tax numeric"
[1] "PBDITA numeric"
[1] "PBT numeric"
[1] "Cash profit numeric"
[1] "PBDITA as % of total income numeric"
[1] "PBT as % of total income numeric"
[1] "PAT as % of total income numeric"
[1] "Cash profit as % of total income numeric"
[1] "PAT as % of net worth numeric"
[1] "Sales numeric"
[1] "Income from financial services numeric"
[1] "Other income numeric"
[1] "Total capital numeric"
[1] "Reserves and funds numeric"
[1] "Deposits (accepted by commercial banks) logical"
[1] "Borrowings numeric"
[1] "Current liabilities & provisions numeric"
[1] "Deferred tax liability numeric"
[1] "Shareholders funds numeric"
[1] "Cumulative retained profits numeric"
[1] "Capital employed numeric"
[1] "TOL/TNW numeric"
[1] "Total term liabilities / tangible net worth numeric"
[1] "Contingent liabilities / Net worth (%) numeric"
[1] "Contingent liabilities numeric"
[1] "Net fixed assets numeric"
[1] "Investments numeric"
[1] "Current assets numeric"
[1] "Net working capital numeric"
[1] "Quick ratio (times) numeric"
[1] "Current ratio (times) numeric"
[1] "Debt to equity ratio (times) numeric"
[1] "Cash to current liabilities (times) numeric"
[1] "Cash to average cost of sales per day numeric"
[1] "Creditors turnover character"
[1] "Debtors turnover character"
[1] "Finished goods turnover character"
[1] "WIP turnover character"
[1] "Raw material turnover character"
[1] "Shares outstanding character"
[1] "Equity face value character"
[1] "EPS numeric"
[1] "Adjusted EPS numeric"
[1] "Total liabilities numeric"
[1] "PE on BSE character"
[1] "Default numeric"

This plot shows that the training dataset has 6.8% missing observations and
1.9% missing columns.
>plot_intro(newtrain)

The variables of type character are converted to the type numeric and also the missing
observations in a column are replaced with the median of that column for the whole of the
training dataset.

>for(i in 1:ncol(newtrain)){
newtrain[,i] <- as.numeric(newtrain[,i])
newtrain[is.na(newtrain[,i]), i] <- median(newtrain[,i], na.rm = TRUE)
}
The training dataset has missing observations as well as columns. The missing
observations are replaced with median of that column and the missing columns are removed from the
dataset.
>newtrain <- newtrain[,-22]

On running the plot again, it shows that the training dataset does not have any missing
observations or columns.
>plot_intro(newtrain)

Similarly the testing dataset also has variables of the type character.
> newtest<-as.data.frame(newtest)
> for(i in 1:length(newtest)){
+ print(paste(colnames(newtest[i]),class(newtest[,i])))}
[1] "Num numeric"
[1] "Default - 1 numeric"
[1] "Total assets numeric"
[1] "Net worth numeric"
[1] "Total income numeric"
[1] "Change in stock numeric"
[1] "Total expenses numeric"
[1] "Profit after tax numeric"
[1] "PBDITA numeric"
[1] "PBT numeric"
[1] "Cash profit numeric"
[1] "PBDITA as % of total income numeric"
[1] "PBT as % of total income numeric"
[1] "PAT as % of total income numeric"
[1] "Cash profit as % of total income numeric"
[1] "PAT as % of net worth numeric"
[1] "Sales numeric"
[1] "Income from financial services numeric"
[1] "Other income numeric"
[1] "Total capital numeric"
[1] "Reserves and funds numeric"
[1] "Deposits (accepted by commercial banks) logical"
[1] "Borrowings numeric"
[1] "Current liabilities & provisions numeric"
[1] "Deferred tax liability numeric"
[1] "Shareholders funds numeric"
[1] "Cumulative retained profits numeric"
[1] "Capital employed numeric"
[1] "TOL/TNW numeric"
[1] "Total term liabilities / tangible net worth numeric"
[1] "Contingent liabilities / Net worth (%) numeric"
[1] "Contingent liabilities numeric"
[1] "Net fixed assets numeric"
[1] "Investments numeric"
[1] "Current assets numeric"
[1] "Net working capital numeric"
[1] "Quick ratio (times) numeric"
[1] "Current ratio (times) numeric"
[1] "Debt to equity ratio (times) numeric"
[1] "Cash to current liabilities (times) numeric"
[1] "Cash to average cost of sales per day numeric"
[1] "Creditors turnover character"
[1] "Debtors turnover character"
[1] "Finished goods turnover character"
[1] "WIP turnover character"
[1] "Raw material turnover character"
[1] "Shares outstanding character"
[1] "Equity face value character"
[1] "EPS numeric"
[1] "Adjusted EPS numeric"
[1] "Total liabilities numeric"
[1] "PE on BSE character"

This plot shows that the testing dataset has 7% missing observations and 1.9% missing
columns.
> plot_intro(newtest)

> for(i in 1:ncol(newtest)){


+ newtest[,i] <- as.numeric(newtest[,i])
+ newtest[is.na(newtest[,i]), i] <- median(newtest[,i], na.rm = TRUE)
+}
The missing columns are removed from the dataset.
>newtest <- newtest[,-22]
This plot shows that the dataset does not contain any missing observations or missing
columns.

>plot_intro(newtest)
Outlier treatment
The outliers in the dataset are treated by replacing the observations lesser than the 1st
percentile with value of the 1st percentile and the observations more than the 99th
percentile with the value of the 99th percentile. This outlier treatment is done for every
column in the dataset.

>for(i in 2:ncol(newtrain)){
q <- quantile(newtrain[,i], c(0.1, 0.99))
newtrain[,i] <- squish(newtrain[,i], q)
}

Redundant variables are removed from the Training and Testing dataset.
>newtrain <- newtrain[,-c(1,2)]
>newtest <- newtest[,-1]
Univariate and Multivariate Analysis
The variables can be explored further and can be analyzed using univariate and
multivariate analysis

>plot_str(newtrain)

> plot_intro(newtrain)
>plot_missing(newtrain)

> plot_histogram(newtrain)
plot_qq(newtrain)
>plot_bar(newtrain)

>plot_correlation(newtrain)
# for new variables
>newtrain$Profitability <- newtrain$`Profit after tax`/newtrain$Sales
>newtrain$PriceperShare <- newtrain$EPS*newtrain$`PE on BSE`

# Liquidity
newtrain$NWC2TA <- newtrain$`Net working capital`/newtrain$`Total assets`

# Leverage
newtrain$TotalEquity <- newtrain$`Total liabilities`/newtrain$`Debt to equity ratio (times)`
newtrain[is.infinite(newtrain[,54]), 54] <- 0
newtrain$EquityMultiplier <- newtrain$`Total assets`/newtrain$TotalEquity
newtrain[is.infinite(newtrain[,55]), 55] <- 0
newtrain$Networth2Totalassets <- newtrain$`Net worth`/newtrain$`Total assets`
newtrain$Totalincome2Totalassets<- newtrain$`Total income`/newtrain$`Total assets`
newtrain$Totalexpenses2Totalassets <-newtrain$`Total expenses`/newtrain$`Total assets`
newtrain$Profitaftertax2Totalassets <-newtrain$`Profit after tax`/newtrain$`Total assets`
newtrain$PBT2Totalassets <-newtrain$PBT/newtrain$`Total assets`

# for multicollinearity
library(VIF)
vif(newtrain)
for(i in 1:length(newtrain)){
print(paste(colnames(newtrain[i]),class(newtrain[,i])))}
[1] "Total assets tbl_df" "Total assets tbl" "Total assets data.frame"
[1] "Net worth tbl_df" "Net worth tbl" "Net worth data.frame"
[1] "Total income tbl_df" "Total income tbl" "Total income data.frame"
[1] "Change in stock tbl_df" "Change in stock tbl" "Change in stock data.frame"
[1] "Total expenses tbl_df" "Total expenses tbl" "Total expenses data.frame"
[1] "Profit after tax tbl_df" "Profit after tax tbl" "Profit after tax data.frame"
[1] "PBDITA tbl_df" "PBDITA tbl" "PBDITA data.frame"
[1] "PBT tbl_df" "PBT tbl" "PBT data.frame"
[1] "Cash profit tbl_df" "Cash profit tbl" "Cash profit data.frame"
[1] "PBDITA as % of total income tbl_df" "PBDITA as % of total income tbl"
[3] "PBDITA as % of total income data.frame"
[1] "PBT as % of total income tbl_df" "PBT as % of total income tbl"
[3] "PBT as % of total income data.frame"
[1] "PAT as % of total income tbl_df" "PAT as % of total income tbl"
[3] "PAT as % of total income data.frame"
[1] "Cash profit as % of total income tbl_df" "Cash profit as % of total income tbl"
[3] "Cash profit as % of total income data.frame"
[1] "PAT as % of net worth tbl_df" "PAT as % of net worth tbl" "PAT as % of net worth data.frame"
[1] "Sales tbl_df" "Sales tbl" "Sales data.frame"
[1] "Income from financial services tbl_df" "Income from financial services tbl"
[3] "Income from financial services data.frame"
[1] "Other income tbl_df" "Other income tbl" "Other income data.frame"
[1] "Total capital tbl_df" "Total capital tbl" "Total capital data.frame"
[1] "Reserves and funds tbl_df" "Reserves and funds tbl" "Reserves and funds data.frame"
[1] "Borrowings tbl_df" "Borrowings tbl" "Borrowings data.frame"
[1] "Current liabilities & provisions tbl_df" "Current liabilities & provisions tbl"
[3] "Current liabilities & provisions data.frame"
[1] "Deferred tax liability tbl_df" "Deferred tax liability tbl" "Deferred tax liability data.frame"
[1] "Shareholders funds tbl_df" "Shareholders funds tbl" "Shareholders funds data.frame"
[1] "Cumulative retained profits tbl_df" "Cumulative retained profits tbl"
[3] "Cumulative retained profits data.frame"
[1] "Capital employed tbl_df" "Capital employed tbl" "Capital employed data.frame"
[1] "TOL/TNW tbl_df" "TOL/TNW tbl" "TOL/TNW data.frame"
[1] "Total term liabilities / tangible net worth tbl_df"
[2] "Total term liabilities / tangible net worth tbl"
[3] "Total term liabilities / tangible net worth data.frame"
[1] "Contingent liabilities / Net worth (%) tbl_df" "Contingent liabilities / Net worth (%) tbl"
[3] "Contingent liabilities / Net worth (%) data.frame"
[1] "Contingent liabilities tbl_df" "Contingent liabilities tbl" "Contingent liabilities data.frame"
[1] "Net fixed assets tbl_df" "Net fixed assets tbl" "Net fixed assets data.frame"
[1] "Investments tbl_df" "Investments tbl" "Investments data.frame"
[1] "Current assets tbl_df" "Current assets tbl" "Current assets data.frame"
[1] "Net working capital tbl_df" "Net working capital tbl" "Net working capital data.frame"
[1] "Quick ratio (times) tbl_df" "Quick ratio (times) tbl" "Quick ratio (times) data.frame"
[1] "Current ratio (times) tbl_df" "Current ratio (times) tbl" "Current ratio (times) data.frame"
[1] "Debt to equity ratio (times) tbl_df" "Debt to equity ratio (times) tbl"
[3] "Debt to equity ratio (times) data.frame"
[1] "Cash to current liabilities (times) tbl_df" "Cash to current liabilities (times) tbl"
[3] "Cash to current liabilities (times) data.frame"
[1] "Cash to average cost of sales per day tbl_df" "Cash to average cost of sales per day tbl"
[3] "Cash to average cost of sales per day data.frame"
[1] "Creditors turnover tbl_df" "Creditors turnover tbl" "Creditors turnover data.frame"
[1] "Debtors turnover tbl_df" "Debtors turnover tbl" "Debtors turnover data.frame"
[1] "Finished goods turnover tbl_df" "Finished goods turnover tbl"
[3] "Finished goods turnover data.frame"
[1] "WIP turnover tbl_df" "WIP turnover tbl" "WIP turnover data.frame"
[1] "Raw material turnover tbl_df" "Raw material turnover tbl" "Raw material turnover data.frame"
[1] "Shares outstanding tbl_df" "Shares outstanding tbl" "Shares outstanding data.frame"
[1] "Equity face value tbl_df" "Equity face value tbl" "Equity face value data.frame"
[1] "EPS tbl_df" "EPS tbl" "EPS data.frame"
[1] "Adjusted EPS tbl_df" "Adjusted EPS tbl" "Adjusted EPS data.frame"
[1] "Total liabilities tbl_df" "Total liabilities tbl" "Total liabilities data.frame"
[1] "PE on BSE tbl_df" "PE on BSE tbl" "PE on BSE data.frame"
[1] "Default tbl_df" "Default tbl" "Default data.frame"
[1] "Profitability tbl_df" "Profitability tbl" "Profitability data.frame"
[1] "NWC2TA tbl_df" "NWC2TA tbl" "NWC2TA data.frame"
[1] "TotalEquity tbl_df" "TotalEquity tbl" "TotalEquity data.frame"
[1] "EquityMultiplier tbl_df" "EquityMultiplier tbl" "EquityMultiplier data.frame"
[1] "Networth2Totalassets tbl_df" "Networth2Totalassets tbl" "Networth2Totalassets data.frame"
[1] "Totalincome2Totalassets tbl_df" "Totalincome2Totalassets tbl"
[3] "Totalincome2Totalassets data.frame"
[1] "Totalexpenses2Totalassets tbl_df" "Totalexpenses2Totalassets tbl"
[3] "Totalexpenses2Totalassets data.frame"
[1] "Profitaftertax2Totalassets tbl_df" "Profitaftertax2Totalassets tbl"
[3] "Profitaftertax2Totalassets data.frame"
[1] "PBT2Totalassets tbl_df" "PBT2Totalassets tbl" "PBT2Totalassets data.frame"

# Logistic Regression
trainLOGIT<- glm(Default~.-Profitability,data = newtrain, family=binomial)
summary(trainLOGIT)
trainLOGIT <- glm(Default~`Total assets`+`Total income`+`Change in stock`+`Total expenses`+`Profit after tax`+PBDITA+`Cash
profit`+`PBDITA as % of total income`+`PBT as % of total income`+`PAT as % of total income`+`Cash profit as % of total income`+`PAT
as % of net worth`+`Total capital`+`Reserves and funds`+`Borrowings`+`Current liabilities & provisions`+`Capital employed`+`Total term
liabilities / tangible net worth`+`Contingent liabilities`+`Current ratio (times)`+Investments+`Finished goods turnover`+`TOL/TNW`+`PE on
BSE` +`Net fixed assets`+`Debt to equity ratio (times)`+`Cash to average cost of sales per
day`+PriceperShare+NWC2TA+Networth2Totalassets+Sales2Totalassets+Capitalemployed2Totalassets+ Investments2Totalassets ,
data= newtrain, family = binomial)
summary(trainLOGIT)

# model validation
library(pROC)
PredLOGIT <- predict.glm(trainLOGIT, newdata=newtrain, type="response")
tab.logit<-confusion.matrix(newtrain$Default,PredLOGIT,threshold = 0.5)
tab.logit
accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
accuracy.logit
[1] 0.9591719
roc.logit<-roc(newtrain$Default,PredLOGIT )
roc.logit

plot(roc.logit)

PredLOGIT <- predict.glm(trainLOGIT, newdata=newtest, type="response")


tab.logit<-confusion.matrix(newtest$`Default - 1`,PredLOGIT,threshold = 0.5)
tab.logit
accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
accuracy.logit
roc.logit<-roc(newtest$`Default - 1`,PredLOGIT )
roc.logit
plot(roc.logit)

# train data declining


newtrain$pred = predict(trainLOGIT, newtrain, type="response")

decile <- function(x)


{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
{
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
newtrain$deciles <- decile(newtrain$pred)
tmp_DT = data.table(newtrain)

rank <- tmp_DT[, list(cnt=length(Default),


cnt_resp=sum(Default==1),
cnt_non_resp=sum(Default==0)
), by=deciles][order(-deciles)]

rank$rrate <- round(rank$cnt_resp / rank$cnt,4);


rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
rank$rrate <- percent(rank$rrate) rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
newtrainRank <- rank
View(rank)

The ranks of the deciles are seen above. The deciles are sorted in the descending order. The 10th
decile has the maximum number of defaults in the form of cnt_resp.
The testing dataset is then divided into 10 deciles based on the probability of default

# test data declining


newtest$pred = predict(trainLOGIT, newtest, type="response")
decile <- function(x)
{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
{
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
}
newtest$deciles <- decile(newtest$pred)
tmp_DT = data.table(newtest)

rank <- tmp_DT[, list(cnt=length(`Default - 1`),


cnt_resp=sum(`Default - 1`==1),
cnt_non_resp=sum(`Default - 1`==0)
), by=deciles][order(-deciles)]
Analysis
The model predicts the training and testing dataset with almost 95% accuracy.
Among the most important variables, the variables with positive estimates are Total Assets,
Current ratio, Sales/Total Assets and the variables with negative estimates are Cash
Profit, PAT as % of net worth,Current Liabilities and Provisions, Capital employed

Vous aimerez peut-être aussi