DMBD Project Proposal - Team 5

Project Proposal
Data Mining for Business Decisions
Team No 5
Kapil Paniker
Shashi B Pandey
Ragil Ravindran
Georgy K Joseph
Supriyo
Chakrabarti
Harshad
Suryawanshi
IIM Trichy
130101
4
130104
1
130105
6
130106
8
130110
2
130110
3
Business Context
Banks play one of the most vital roles in economies worldwide. Every
economy thrives on the basis of investments and savings. For markets and
society to function, individuals, corporates and governments need access to
credit. Banks are the critical financial intermediaries between the savers and
the investors by lending the savings of households in the form of loans to
corporate and government for investment. An important task for banks is to
decide whether a customer can get finance and if so, then on what terms.
These decisions are vital for a banks business as credit risk is the most
important risk in the banking sector and dictates financial stability.
Probability of default models are used by banks in arriving at decisions
regarding whether to grant a loan to a customer or not. Credit scoring is a
mechanism which associates the probability of default with the borrowers
based on historic data. Using credit scoring method, banks can identify
borrowers with higher probability of default and take business decisions
accordingly.
Business Objective
The objective of this project is to develop a model that would predict whether
a person in the dataset would experience financial distress in the next two
years. Based on this, a credit score would be assigned to each borrower. The
credit score would be an indication of the probability of default. Higher the
credit score, lower will the probability of default and vice versa.
The benefit of these credit scores will be twofold. The banks will know which
potential customers have highest probability of default and hence could
avoid giving them loans altogether. Also, those customers who have lowest
probability of default would be charged the lowest interest rate by the bank
and the ones having moderate probability of default will be charged higher
interest rate as they pose credit risk (default risk) to the bank.
Mining objectives
Since we have a large dataset we would try to identify clusters of data that
are homogenous in certain aspects. These clusters could contain those
potential customers who have similar credit risks and hence would have
similar credit ratings. Also, we would try to identify attributes that are
correlated to each other and those which could help us calculate the
probability of default. We would also look at the possibility of deriving new
attributes from existing ones. The end result of this mining exercise would
help us assign a credit score to each customer in the dataset. Based on this
credit score, the banks could decide which customers to lend to and which
ones to avoid. Also, banks could charge variable interest rates to different
customers based on their credit scores; higher the credit score, lower the
interest rate.
Specific questions we seek to answer using Data Mining
The project aims to seek answers to the following questions
What is the probability of default for each borrower?

Does the probability of default have any correlation with customer
attributes? In other words, whether all the attributes are significant and
necessarily contribute in determining the probability of default?
Can customers be grouped into specific clusters based on their credit
and default patterns?
Overview of the data set

Source: https://www.kaggle.com/c/GiveMeSomeCredit/data
No. of attributes: 10
No. of records: 1,50,000
Attributes to be employed in data miing
1
Variable Name
RevolvingUtilizationOfUnsecu
redLines
2
3
age
NumberOfTime3059DaysPastDueNotWorse
DebtRatio
MonthlyIncome
Description
Total balance on credit cards
and personal lines of credit
except real estate and no
installment debt like car loans
divided by the sum of credit
limits
Age of borrower in years
Number of times borrower
has been 30-59 days past due
but no worse in the last 2
years.
Monthly debt payments,
alimony,living costs divided
by monthy gross income
Monthly income
Type
percent
age
integer
integer
percent
age
real
NumberOfOpenCreditLinesAn
dLoans
NumberOfTimes90DaysLate
NumberRealEstateLoansOrLi
nes
NumberOfTime6089DaysPastDueNotWorse
1
0
NumberOfDependents
Number of Open loans

(installment like car loan or
mortgage) and Lines of credit
(e.g. credit cards)
has been 90 days or more
past due.
Number of mortgage and real
estate loans including home
equity lines of credit
has been 60-89 days past due
but no worse in the last 2
years.
Number of dependents in
family excluding themselves
(spouse, children etc.)
Distribution of various attributes
integer
integer
integer
integer
integer
At this point, no two variables look to be highly correlated. During the

process, if we find high correlation between any pair of variables, we will look
at options like deriving a new variable from the two variables, eliminating the
variable among the two with lesser predictive power etc.
Whether to use Prediction Analysis or Not?
Currently we have no plan to do prediction analysis during data analysis
however if during further analysis if we find any scope for prediction
technique to use we will mitigate the issue by normalizing the dataset.
Data Mining Methodology: Clustering and Affinity concepts of data
mining will be used for the purpose of data set analysis to determine. Affinity
analysis is a data mining technique that discovers co-occurrence
relationships among activities performed by specific individuals or groups.
Since in the case of loan default, most of the individual will be carrying same
kind of financial distress situation such as no or low income, no further
business prospect etc. This technique will help in aggregating the data
having affinity and form influential attributes. These attributes can then be
analysed further for their applicability for credit appraisal process.
Recommendation from Analysis results: Results from the Data mining
project should be able to identify critical attributes that are important for
credit appraisal for individuals and institutions. Using these attributes, a
credit appraisal algorithm model can be build which will enhance the quality
of earlier credit appraisal processes in a bank and will result in identifying
default probability of customers in next two years thereby minimising the
credit risk.

DMBD Project Proposal - Team 5

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

DMBD Project Proposal - Team 5

Transféré par

Droits d'auteur :

Formats disponibles

Project Proposal

Data Mining for Business Decisions

What is the probability of default for each borrower?

Overview of the data set

Number of Open loans

Distribution of various attributes

At this point, no two variables look to be highly correlated. During the

Vous aimerez peut-être aussi