Info System-Datamining

Journal of Management Information Systems / Winter 2008–9, Vol. 25, No. 3, pp.
315–
336.
© 2009 M.E. Sharpe, Inc.
0742–1222 / 2009 $9.50 + 0.00.
DOI 10.2753/MIS0742-1222250309
Tuning Data Mining Methods for
Cost-Sensitive Regression: A Study
in Loan Charge-Off Forecasting
GauraV BaNSal, atISh P. SINha, aND huIMIN ZhaO
Ga u r a v Ba n s a l is an assistant Professor of MIS/Statistics at the university of
Wisconsin–Green Bay. he earned his Ph.D. in MIS from the university of Wisconsin–
Milwaukee, M.B.a. from Kent State university, and B.E. in Mechanical Engineering
from the Madan Mohan Malaviya Engineering College, university of Gorakhpur, India.
his current research interests are in the areas of information privacy, e-commerce, and
data mining. he is a member of the aIS.
at i s h P. si n h a is a Professor of MIS and roger l. Fitzsimonds Distinguished Scholar
at the Sheldon B. lubar School of Business, university of Wisconsin–Milwaukee. he
earned his Ph.D. in business, with a concentration in artificial intelligence, from the
university of Pittsburgh. his current research interests are in the areas of data mining, data
warehousing, and component-based software engineering. his research has
been published in several journals, including Communications of the ACM, Decision
Support Systems, IEEE Transactions on Engineering Management, IEEE Transactions
on Software Engineering, IEEE Transactions on Systems, Man, and Cybernetics,
Information Systems Research, International Journal of Human–Computer Studies, and
Journal of Management Information Systems. Professor Sinha is a member of aCM,
aIS, and INFOrMS. he served as the cochair of the 16th Workshop on Information
technologies and Systems (WItS) in 2006.
hu i mi n Zh a o is an associate Professor of MIS at the Sheldon B. lubar School of
Business, university of Wisconsin–Milwaukee. he earned his Ph.D. in MIS from the
university of arizona. his current research interests are in the areas of data mining,
data integration, and medical informatics. his research has been published in several
journals, including Journal of Management Information Systems, Communications of
the ACM, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions
on Systems, Man, and Cybernetics, Information Systems, Data and Knowledge
Engineering, Decision Support Systems, and Journal of Database Management. he
is a member of IEEE, aIS, IrMa, and INFOrMS.
aBs t r a c t: real-world predictive data mining (classification or regression) problems
are often cost sensitive, meaning that different types of prediction errors are not equally
costly. While cost-sensitive learning methods for classification problems have been
extensively studied recently, cost-sensitive regression has not been adequately addressed
in the data mining literature yet. In this paper, we first advocate the use of average
misprediction cost as a measure for assessing the performance of a cost-sensitive
regression model. We then propose an efficient algorithm for tuning a regression model
to further reduce its average misprediction cost. In contrast with previous statistical 316
BaNSal, SINha, aND ZhaO
methods, which are tailored to particular cost functions, this algorithm can deal with
any convex cost functions without modifying the underlying regression methods. We
have evaluated the algorithm in bank loan charge-off forecasting, where underforecasting
is considered much more costly than overforecasting. Our results show that the
proposed algorithm significantly reduces the average misprediction costs of models
learned with various base regression methods, such as linear regression, model tree, and
neural network. the amount of cost reduction increases as the difference between the
unit costs of the two types of errors (overprediction and underprediction) increases.
Ke y w o r d s a n d Ph r a s e s: asymmetric costs, cost-sensitive regression, data mining,
forecasting, loan charge-off, model tuning.
orGa n iZa t i o n s a r e i n c r e a s i nGl y e mPl o y i nG data mining techniques to
uncover useful
and actionable information from corporate data. the most common type of problem
addressed in data mining is prediction, where a dependent variable is predicted based
on a set of independent variables. Various data mining techniques have been developed
to automatically induce prediction models based on training examples with known
outcomes. the trained models can then be applied to predict the outcomes of new
problem instances in the future.
there are two types of predictive data mining—classification and regression. In a
classification problem, the dependent variable that needs to be predicted is categorical,
whereas in a regression problem, the dependent variable is numerical and continuous. the
focus of prior research has been on the binary classification problem (where
the dependent variable can belong to one of two categories). Studies addressing the
binary classification problem in business cover a variety of applications, including
workplace Web usage profiling [1], deception detection [28], credit evaluation [18],
bankruptcy prediction [13, 20, 21], bank failure prevention [17, 19], and new venture
success prediction [11]. Data mining and Bayesian techniques have also been used
for the regression problem in applications such as reliability estimation [22], demand
forecasting and inventory management [4], and real estate assessment [24].
real-world prediction problems are often cost sensitive, meaning that different types
of prediction errors are not equally costly [8, 14, 22, 26]. In a binary classification
problem, the cost of a false positive could be very different from that of a false negative.
For example, misclassifying a bankrupt corporation as nonbankrupt is a much
more serious mistake than doing the reverse [20, 21]. Similarly, a regression problem
may also be characterized by asymmetric costs with respect to overprediction and
underprediction. For example, when banks forecast loan losses, underforecasting is
considered to be much more costly than overforecasting. In such cost-sensitive
forecasting or regression problems, the usual cost-neutral performance measures—such
as
correlation coefficient or R
2
, relative absolute error, and root relative square error—are
not appropriate for assessing the true performance of trained models.tuNING Data
MINING MEthODS FOr COSt-SENSItIVE rEGrESSION 317
the last decade has witnessed a growing body of work in cost-sensitive classification.
Several methods have been developed to convert a regular classification method
into a cost-sensitive one. Costs can be directly incorporated into a predictive model
during training [6]. For example, ting [23] used an instance-weighting method for
making decision trees cost sensitive, and tam and Kiang [21] modified the backprop
algorithm for neural nets by including prior probabilities and misclassification costs.
Elkan [8], on the other hand, recommended that we let classifiers learn from the given
training data, and then determine optimal decision thresholds empirically. Sinha and
May [18], for example, presented an approach for tuning predictive data mining models
post hoc to make them cost sensitive. In their study, the models were trained without
factoring in costs. the models were tuned—in an effort to minimize misclassification
cost—only after they had been generated.
Finally, there are some base classification methods (e.g., naive Bayes) that are inherently
cost sensitive [25] and for which both types of cost-sensitive learning result in
similar models. Others, such as decision tree methods, learn very different models when
cost information is incorporated during training [23] rather than post hoc [27].
there have been a number of studies relating to cost-sensitive classification, but
relatively few studies in the data mining literature have addressed the issue of
costsensitive regression. Classical statistical estimation and prediction methods have
been extended to deal with particular asymmetric loss functions, such as linlin, linex,
and squarex, which are amenable to closed-form solutions [22, 24, 26]. Varian [24]
introduced asymmetric linex cost functions, which rise linearly for underestimation
and exponentially for overestimation. the cost function was demonstrated for the
appraisal of single-family homes, where the county bases its tax on the appraised value.
underestimation is costly because it leads to lower taxes, but overestimation is even
more costly because it leads to complaints and court appeals. hence, while
underestimation was represented by a linear cost function, overestimation was
represented by
an exponential cost function. the normal regression techniques, which are based on
quadratic cost functions, assume the over- and undercosts to be equal, and hence are
inappropriate. Zellner [26] also extensively investigated linex across several statistical
estimation and prediction problems. later, thompson and Basu [22], generalizing
the linex cost function [24], introduced the asymmetric squarex cost function, where
the cost follows an exponential function on one side of the error curve and a square
function on the other side. recently, Crone et al. [4] proposed a method for training
a multilayer perceptron under asymmetric loss. however, generic methods that can
convert any base regression methods into cost-sensitive learning methods and can deal
with any cost functions are yet to be developed.
In this paper, we first advocate the use of average misprediction cost as a measure
for assessing the performance of a cost-sensitive regression model. We then propose
an efficient algorithm for tuning the regression model post hoc to further reduce its
average misprediction cost. the algorithm is generic and can deal with any convex cost
functions without modifying the underlying regression methods. We have evaluated
the algorithm on a problem in loan-loss forecasting using real bank data. Our results
show that the proposed algorithm significantly reduces the average misprediction 318
costs of models learned with the following base regression methods—linear regression,
model tree, and neural network. the degree of cost reduction increases as the
difference between the unit costs of the two types of errors (overprediction and
underprediction) increases.
Our study makes several important contributions to the literature. First, we propose
a new measure, average misprediction cost, for the finance and banking domain.
traditionally, cost-insensitive measures, such as R
2
and relative absolute error, have been
used for regression problems. We show that the use of such measures is inappropriate for
situations where costs are asymmetric. Second, to the best of our knowledge,
this is the first study to present an approach for tuning data mining models, so as to
minimize misprediction costs, for regression problems in a post hoc manner—that is,
after the models have already been learned. as discussed later in more detail, this is
a big advantage for organizations that operate in environments where the shape and
parameters of the cost function could change over time. and, finally, we did not find
a single study in the literature comparing the performance of multiple data mining
methods for a cost-sensitive regression problem. In this study, we evaluate the
performance of three popular data mining methods on a real-world forecasting problem
in loan charge-offs.
Bank loan-loss Forecasting
reGr e s s i o n s t u d i e s l a rGe l y d w e l l uPo n R
2
as the measure of accuracy, assigning equal
weights to the deviations from the mean in both directions. this is acceptable when
the concern is only with accurate predictions, and deviations in both directions are
equally undesirable. this approach fails when deviations are weighted unequally, as in
the case of loan charge-off predictions for banks. underpredicting the loan charge-off
amount is much more risky for a bank than overpredicting the same amount because
it presents a rosier picture of an otherwise worse scenario. accurately predicting the
actual loan loss (where actual loan loss = loan charge-off – portion of the loan recovered)
is important not only for banks but also for regulators and investors. Banks
are required to have adequate provisions for loan losses. It is important for banks to
have systems in place for forecasting loan losses. Investors and regulators are always
interested in knowing whether banks are adequately prepared for their loan losses.
regulators, in particular, want to anticipate a bank’s loan losses and then determine
whether the bank is sufficiently prepared to face those losses or not. If the bank does
not have sufficient loan-loss reserves, the consequences could be dire. hence, it is
necessary to penalize underpredictions more heavily than overpredictions, thereby
discouraging banks from having less than adequate amounts as reserves.
If a bank overpredicts its loan charge-off, it has to maintain extra funds in the loan-loss
reserves; hence, a possible problem is that it will experience reduced earnings (because
the reserves are directly deducted from earnings), along with possibly a lower credit
score from its financial analysts. But in the case of underprediction, it will not only
face the wrath of regulators, accountants, and the Securities and Exchange Commission
(SEC) but will also likely witness an even greater downturn in its credit ratings.tuNING
Data MINING MEthODS FOr COSt-SENSItIVE rEGrESSION 319
If banks predict less than what their loan losses finally turn out to be, they are not
going to keep aside enough reserves. If they predict more, they are going to land up
with higher reserves. Not having enough reserves causes regulatory problems. On the
other hand, having higher reserves reduces the income. Of the two, underprediction
is more costly because the banking authorities are also concerned about classifying a
failing bank as a nonproblem bank [19]. Moreover, inadequate provisioning makes the
fluctuations in bank earnings magnify true oscillations in bank profitability [3].
henderson [10] showed that poor economic performance of minority-owned banks could
be the result of underprovisioning for loan losses, reflecting an inadequate assessment
of risk. as he points out, “by examining the provision for loan loss, one can assess how
a bank manages the choice among risk (loan default) and return” [10, p. 373].
It is important to understand that charge-off is not a forgiveness of the debt in any
way. It is only an accounting entry by the one who is owed. a charge-off (or write-off)
is the accounting process where a business acknowledges a receivable (an asset or loan)
is uncollectible. It considers the lost receivable as a charge against its earnings.
In addition to bank management, outside bodies such as the SEC, accountants,
and regulators have a direct voice in the loan-loss reserve creation process. the
SEC plays a role in the reserve formation by warning bank holding companies to
frequently examine their potential loan losses and to adjust the reserves accordingly.
the reserve adequacy is also considered by the financial analysts, especially those
who are responsible for the banks’ credit ratings. Cares [2] identifies that the key test
of adequacy of the loan-loss reserves is the multiple by which the reserve exceeds the
normalized net charge-offs.
Cost-Sensitive regression
Ju s t a s e r r o r r a t e i s n o t aP Pr oPr i a t e f o r a s s e s s i nG model performance
for costsensitive classification problems [8, 14], cost-neutral performance measures
usually
adopted in the literature (e.g., correlation coefficient, relative error) are also not
appropriate performance measures for cost-sensitive regression problems. Similar to the
average misclassification cost measure used for cost-sensitive classification problems
(see, e.g., [8, 18]), we propose average misprediction cost as a performance measure
for assessing cost-sensitive regression models. We then propose an algorithm for tuning
a model learned by regular regression methods to minimize this cost.
Cost-Sensitive Performance Measure
Consider a regression problem where a continuous dependent variable y needs to be
predicted based on a vector of independent variables x. a regression method learns a
prediction model, f: x → y, from a training data set consisting of problem instances
with known dependent variable values, S = {<x
i
,y
i
> | i = 1, 2, ..., N}. assuming that
a prediction error e incurs a cost characterized by a cost function C(e), we define the
average misprediction cost of model f, as estimated on data set S, as320 BaNSal,
SINha, aND ZhaO
θ=(())−
=
∑
1
N1
Cfxy
ii
i
N
.
(1)
Note that a performance measure estimated on the training data set is not a reliable
estimate for the true performance of a learned model; the measure should be estimated
on an independent test data set. In our experiments (reported later), we tuned the models
using the training data sets, but evaluated the performance of the tuned models using
independent test data sets.
Some required properties of the cost function C are
• C(e) ≥ 0;
• C(0) = 0; and
• |e
1
|<|e
2
|∧e
1
e
2
> 0 → C(e
1
) < C(e
2
.(
In other words, the cost function is nonnegative, equals zero when there is no error,
and is monotonic for each type of error.
the cost function C is necessarily problem dependent. For the bank loan-loss forecasting
problem, we assume a simple linlin (linear on both sides) cost function with
different slopes for underforecasting and overforecasting (illustrated in Figure 1), after
the misprediction error has been normalized by the total loan amount of the bank.
More complex nonlinear cost functions are possible in other problems. Conventional
performance measures, such as correlation coefficient and relative error, essentially
assume that C(e) = C(–e) for any misprediction error e. these measures are thus not
suitable in cost-sensitive problems, where C(e) and C(–e) are in general different.
Performance tuning algorithm
regular regression methods, such as linear regression, neural network, and model
tree, seek to optimize cost-neutral performance measures during the model induction
based on the training data set. the models are therefore not optimized on the average
misprediction cost. We propose a method for tuning the performance of a model
trained by a regular regression method after the model induction process. Suppose we
adjust the prediction of a learned regression model f by an amount of δ and denote the
adjusted model f′ = f + δ. the average misprediction cost of the adjusted model f′ is
the following function of δ (Figure 2 shows an example):
θ δ( ) = ′ ( ( ) ) − = + δ ( ( ) ) −
==
∑∑
11
N11
fxy
N
Cfxy
ii
i
N
ii
i
N
C.
(2)
a brute force algorithm can be used to evaluate every possible δ (with a given precision)
until all adjusted predictions become over- (or under-) predictions and return
the δ that results in the lowest average misprediction cost—that is, δ
*
= argmin θ(δ).
however, because θ(δ) is convex when the cost function C is convex (see Proposition
1 in the appendix), which we believe is usually the case, a more efficient hill-climbing
algorithm can be designed to locate δ
*
.Figure 3 lists such an algorithm .tuNING Data MINING MEthODS FOr COSt-
SENSItIVE rEGrESSION 321
the algorithm takes a base regression method, a training data set, a cost function,
and a given precision of adjustment as inputs and returns an adjusted regression
model. First, a regression model is trained using the base regression method, without
considering the cost function (line 1). then, the direction of adjustment, which leads
to lower average misprediction cost, is determined (lines 2 to 4). Starting from zero
adjustment (line 5), several iterations of hill climbing are then performed to approach
the optimal adjustment until the number of climbing steps during an iteration falls
below two and no further climbing is promising (lines 6 to 7). During each iteration
of hill climbing, several climbing steps are attempted until the performance starts to
decrease. to speed up the climbing, the climbing stride starts from the given precision
Figure 1. a linlin Cost Function
Figure 2. average Misprediction Cost as a Function of the amount of adjustment
Notes: the cost curve is based on the March 2003 data set for the bank loan-loss
forecasting
problem. the base regression method used is the M5 model tree. the ratio between the unit
cost of underforecasting and that of overforecasting is 100:1. 322 BaNSal, SINha, aND
ZhaO
of adjustment and is doubled after every step. Finally, an adjusted regression model
with the best found adjustment is returned (line 8).
the algorithm is generic and can be used to tune a model trained with any regression
method for any convex cost function. the algorithm is also efficient. It can be
shown (see Proposition 2 in the appendix) that the worst-case time complexity of
this algorithm is O([log n]
2
), assuming there are n possible δ values that need to be
evaluated by a brute force algorithm.
Procedure and Empirical Evaluation
we h a v e i mPl e m e n t e d t h e Pr oPo s e d performance tuning algorithm by
extending the
Weka machine learning toolkit [25] and evaluated the algorithm using real data of
bank loan losses. We report on some empirical results in this section.
Cost_Sensitive_Regression (G, S, C, p)
G: a base regression method, e.g., M5.
S: a training data set, {<x
i
,y
i
> | i = 1, 2, ..., N}.
C: a cost function.
p: a given precision of adjustment. p > 0.
1. train a regression model f using G based on S.
2. If θ(p) < θ(0), /* θ is defined in equation (2). */
2.1 Set the direction of adjustment, d := 1.
3. Else If θ(–p) < θ(0),
3.1 d := –1.
4. Else
4.1 return f.
5. δ
prev
:= 0.
6. loop
6.1 Set the hill-climbing stride, s := 1.
6.2 δ := δ
prev
.
6.3 loop
6.3.1 δ
prev
:= δ.
6.3.2 δ := δ + s * d * p.
6.3.3 s := s * 2;
6.3.4 δ
next
:= δ + s * d * p.
6.4 until θ(δ
next
) > θ(δ).
7. until s ≤ 2.
8. return an adjusted regression model f′ = f + δ.
Figure 3. a Performance tuning algorithm for Cost-Sensitive regression ProblemstuNING
Data MINING MEthODS FOr COSt-SENSItIVE rEGrESSION 323
Implementation
We implemented the proposed performance tuning algorithm in Java as a subclass of
the Classifier class in Weka (www.cs.waikato.ac.nz/ml/weka/). the algorithm takes
four inputs—base regression method, training data set, cost function, and given precision
of adjustment. We used base regression methods and the attribute-relation file
format (arFF) of Weka. We implemented the linlin cost function. We used a 0.01
precision of adjustment, which should provide sufficient precision. Because the base
regression method and the cost function are inputs to the tuning algorithm and do not
need to be modified by the tuning algorithm, our program can be easily extended to
work with wrapped plug-in modules of other cost functions and regression methods
from other software packages such as SPSS and SaS. this methodology can therefore
be easily applied in practical contexts, which may require special cost functions and
particular software packages.
Base regression Methods
We used three base regression methods available in Weka: linear regression (lr), the
M5 model tree, and backpropagation neural network (NN). lr implements the standard
least-squares linear regression method. M5 [15] follows a “divide and conquer,”
recursive partitioning heuristic search strategy and induces a tree with linear regression
functions at the leaves. Backpropagation [16] is one of the most widely used neural
network learning techniques for classification and regression. We kept Weka’s default
settings for all of the parameters.
Weka reports several cost-insensitive performance measures for these regression
methods, including correlation coefficient, relative absolute error, and root relative
square error. Correlation coefficient is the Pearson linear correlation between the
predicted and actual values of the dependent variable. relative absolute error is the ratio
of the mean absolute error of the learned model to the mean absolute error obtained
by simply predicting the mean of the training data. Similarly, root relative square error
is the ratio of the mean squared error of the learned model to the mean squared error
obtained by simply predicting the mean of the training data.
Data Set
We used the “bank regulatory” data set from Wharton research Data Services
(WrDS).
1
WrDS contains five databases for regulated depository financial institutions.
these databases provide accounting data for bank holding companies, commercial
banks, savings banks, and savings and loans institutions. the data were acquired from
the required regulatory forms that were filed for supervising purposes. We used the
commercial banks data set within the Bank regulatory database for this study. the
Commercial Bank database, originating from the Federal reserve Bank of Chicago
(FrB Chicago), contains data of all banks filing the report of Condition and Income
(known as “Call report”) regulated by the Federal reserve System, Federal Deposit 324
Insurance Corporation (FDIC), and the Comptroller of the Currency. these reports
include balance sheet, income statements, risk-based capital measures, and off-balance
sheet data. the database covers commercial banks and savings banks. the data set
used in this study has approximately 8,500 observations for each quarter.
the sample data set we used covers 17 quarters from March 2001
2
to March 2005.
We used one quarter as a training set to predict the next quarter (test set). We also
used aggregated four quarters as a training set to predict up to four quarters ahead, one
quarter at a time. For example, the aggregated data set from March 2003 to December
2003 is used to predict March 2004, June 2004, September 2004, and December 2004
quarters separately.
Variables
as noted before, the dependent variable in this study is loan charge-off (rIaD 4635).
3
Based on the relevant literature, we chose a set of independent variables related to size
[9], risk, and economy [12]. In addition to these three categories of variables, we also
included two other categories—intangible asset variables and loan-specific variables.
there are three variables in the size category—total assets, net income, and net interest
income. the risk category has two variables—weighted average total assets and
subordinated debt. the economy category has two variables—expense on fed funds
and equity capital. the intangible asset category has two variables—intangible assets
and goodwill. loan-specific variables are total loans and lease (gross), total loans not
accruing, loans 90+ days late, and interest and fee income from loans. the WrDS
data set provides the definitions for these variables (summarized in table 1).
Cost Information
as discussed earlier, underprediction hides the woes of the bank and presents a rosier
picture of an otherwise bad scenario. Overprediction is less costly because the only
cost associated with it is the extra provision the bank has to provide in the loan-loss
reserves, and this may lead to lowered earnings in that particular quarter. But, compared
to the case of underprediction, the financial distress is much less.
We normalized the misprediction error (difference between the predicted and
actual loan charge-offs) by the total loan amount of a bank (rCFD 1400) and then
applied a linlin cost function on the normalized misprediction error. the normalized
misprediction error is
e=
Predicted gross charge-off − Actual gross charge-off
Aggregaate gross book value of total loans
.
the cost function is
C(e) = c
+
e, if e > 0; c
–
e, otherwise,
where c
+
and c
–
are the slopes for overprediction and underprediction, respectively.tuNING Data
sed in the Study u able 1. Variables t
Explanation Definition Variable name Code
Belongs to the loan category. • of total loans (before book value The aggregate gross
Total loans and leases,
1
RCFD
Book value—value at which an asset is carried on a • deduction of valuation reserves).
gross 1400
balance sheet. For example, a piece of manufacturing
equipment is put on the book at its cost when
purchased [7].
Belongs to the loan category. • Includes the outstanding balances of loans and lease Total
loans and lease RCFD
Receivables—accounts receivable owned by • financing receivables that the bank has
placed in finance receivables: 1403
borrowers that are pledged as collateral for a loan nonaccrual status. Also includes all
restructured loans nonaccrual
made with the bank [5]. that are in nonaccrual receivables and lease financing
status.
Belongs to the loan category. • on receivables Includes loans and lease financing Total
loans and lease RCFD
which payment is due and unpaid for 90 days or more. financing receivables: 1407
Also includes all restructured loans and leases. past due 90 days or
more and still accruing
Belongs to the intangible assets category. • assets. intangible amount of unamortized
Includes the Intangible assets RCFD
Amortization—accounting procedure that gradually • 2143
reduces the cost value of a limited-life or intangible
asset through periodic changes to income. For fixed
and for wasting depreciation assets the term used is
both terms depletion, assets (natural resources) it is
meaning essentially the same thing as amortization [7].
Intangible asset—right or nonphysical resource that •
is presumed to represent an advantage to the firm’s
position in the marketplace [7].
Belongs to the size category. • It is the sum of all asset items. It equals “total liabilities,
Total assets RCFD
Equity—difference between the amount a property • capital.” equity limited-life
preferred stock, and 2170
could be sold for and the claims held against it [7].
(continues)326 BaNSal, SINha, aND ZhaO
able 1. Continued t
Explanation Definition Variable name Code
Belongs to the intangible assets category. • Includes the amount (book value) of
unamortized goodwill. Goodwill RCFD
Goodwill—intangible asset representing going concern • Represents the excess of the
cost of a company over the 3163
value in excess value paid by a company for another sum of the fair values of the
tangible assets and identifiable
company in a purchase acquisition [7]. intangible assets acquired less the fair value of
liabilities.
Belongs to the risk category. • subordinated Includes the amount of outstanding
Subordinated notes RCFD
Subordinated—junior in claim on assets to other debts, • notes and debentures (including
mandatory and debentures 3200
that is, repayable only after other debts with a higher convertible debt).
claim have been satisfied [7].
Belongs to the economy category. • The sum of “perpetual preferred stock and related
surplus,” capital, total Equity RCFD
“common stock,” “surplus,” “undivided profits and capital 3210
reserves,” “cumulative foreign currency translation
adjustments” less “net unrealized loss on marketable
equity securities.”
Belongs to the loan category. • Includes the total of interest and fee income and similar
Total interest and RCFD
charges levied against all assets classified as loans in fee income on loans 4010
condition reports, including fees on overdrafts. Includes
investigation and service charges, renewal and past due
charges, commitment fees (regardless of whether the
loan has been made), and fees charged for the execution
of mortgages or agreements securing the bank’s loans.
Belongs to the size category. • activities,” “service fiduciary Includes the sum of
“income from Total noninterest
2
RIAD
Fiduciary—person, company, or association holding • charges on deposit accounts in
domestic offices,” “trading income 4079
asset in trust for a beneficiary. The fiduciary is charged gains (losses) and fees from
foreign exchange transactions,”
with the responsibility of investing the money wisely “other foreign transaction gains
(losses),” “gains (losses) and
for the beneficiary benefit [7]. fees from assets held in trading accounts,” and “other
noninterest income.”
Belongs to the economy category. • Includes the gross expense of all liabilities included
in Expense of federal RCFD
Federal funds—funds deposited by commercial banks • purchased and securities sold
under federal funds “ funds purchased and 4180
at Federal Reserve banks, including funds in excess of agreements to repurchase.”
securities sold under tuNING Data MINING MEthODS FOr COSt-SENSItIVE
rEGrESSION 327
bank reserve requirements. Banks may lend federal agreements to
funds to each other on an overnight basis at the federal repurchase
funds rate. Member banks may also transfer funds
among themselves or on behalf of customers on a
same-day basis by debiting and crediting balances in the
various reserve banks [7].
Belongs to the size category. • Includes the net income (loss) for the period. Net income
(loss) RIAD
4340
Belongs to the risk category. • It is the amount of the bank’s risk-weighted assets net of
all Risk-weighted assets RCFD
Risk-based capital ratio—FIRREA (Financial Institution • deductions. The amount
reported in this item is the (net of allowances A223
Reform and Recovery Act)-imposed requirement that risk-based capital ratio.
denominator of the bank’s total and other deductions)
banks maintain a minimum ratio of estimated total When determining the amount of risk-
weighted assets,
capital to estimated risk-weighted assets [7]. on-balance sheet assets are assigned an
appropriate
risk weight (0 percent, 20 percent, 50 percent, or 100
percent) and off-balance sheet items are first converted
to a credit equivalent amount and then assigned to one
of the four risk weight categories. The on-balance sheet
assets and the credit equivalent amounts of off-balance
sheet items are then multiplied by the appropriate risk
weight percentages and the sum of these risk-weighted
amounts, less certain deductions, is the bank’s gross
risk-weighted assets.
Numerator of the dependent variable. • on loans and charge-offs The amount of gross
Charge-offs on RIAD
Charge-off (bad debt)—open account balance or loan • leases during the calendar year-
to-date. allowance for loan 4635
receivable that has proven uncollectible and is written and lease losses
off. Traditionally, companies and financial institutions
for uncollectible accounts, reserve have maintained a
charging the reserve for actual bad debts and making
annual, tax-deductible charges to income to replenish or
increase the reserve [7]. Reserves—a portion of the
bank’s funds that has been set aside for the purpose of
assuring its ability to meet its liabilities to depositors in
cash [5].
Notes:
1
DS data set). r eport of Income (W r D variable—from the aI r D: aI r2
DS data set). r eport of Condition (W r CFD variable—from the r CFD: r328
Consistent with the above logic, we used steeper cost slopes for underprediction
as compared to overprediction—that is, c
–
>c
+
We fixed c .
+
at 1 and manipulated c
–
.
In particular, we examined the following cost ratios (c
–
to c
+
): 100:1, 50:1, 20:1, and
10:1. For the sake of completeness, we also examined the cost-insensitive case, where
the cost ratio is 1:1.
results
table 2 presents the results of the lr, NN, and M5 models built using four-quarter
data. the cost figures shown are the mean costs for each base regression method (lr,
NN, or M5) and tuning (without or with) combination; the means were computed by
averaging the costs across the five cost ratios (1:1, 10:1, 20:1, 50:1, and 100:1). For
each method, the costs go down when the models are tuned. M5 is the best performer,
followed by NN and lr.
a 3 × 2 factorial analysis of variance (aNOVa) procedure with method (3 values)
and tuning (2 values) as the factors and average misprediction cost as the dependent
variable was conducted to test for the significance of the effects. the cost ratio variable
was used as a covariate to control for its effects on cost. Both the main effects
were significant at the 0.001 level. the interaction effect between method and tuning
was not significant (p = 0.723). We can therefore conclude unambiguously that both
method (F = 81.857) and tuning (F = 31.823) have a significant influence on
misprediction costs. Pairwise comparisons between the methods indicated that M5 was
significantly better than lr and NN (p < 0.001), and NN was significantly better than
lr (p < 0.001). the significance levels were adjusted for multiple comparisons using
the Bonferroni method. also, we found that the tuned models performed significantly
better than the untuned ones (p < 0.001).
table 3 presents results analogous to table 2 but for models using one-quarter data.
the results are similar to those of table 2, with M5 performing the best, followed by
NN and lr. the aNOVa produced similar results with both method (F = 77.209) and
tuning (F = 25.503) turning out to be significant factors influencing cost (p < 0.001);
the interaction effect was not significant (p = 0.884). Pairwise comparisons between
the methods yielded similar results, with M5 significantly better than lr and NN (p <
0.001), and NN significantly better than lr (p < 0.001). as before, the tuned models
performed significantly better than the untuned ones (p < 0.001).
tables 2 and 3 were generated by aggregating over all the cost ratios. tables 4 and
5 show the costs separately for each cost ratio using a specific method and a tuning
condition. For the 1:1 cost ratio, tuning is not needed, so only one column is shown for
that ratio. For all other cost ratios, we found that costs always go down with tuning,
and M5 yields the lowest cost, followed by NN and lr. the results hold for models
based on four-quarter data as well as one-quarter data.
Next, we conducted separate statistical tests for each cost ratio (other than 1:1) to
examine the effects of method and tuning. In particular, we conducted paired t-tests
on the cost results for the same bank across the two tuning scenarios. that is, each
pair represents without-tuning and with-tuning costs for a specific bank. all the tests
tuNING Data MINING MEthODS FOr COSt-SENSItIVE rEGrESSION 329
yielded significant results, both for one-quarter and four-quarter data, indicating that
tuning significantly improves cost performance.
table 6 summarizes the cost-insensitive performance measures reported by Weka
for the base regression methods, including correlation coefficient, relative absolute
error, and root relative square error. an important thing to note is that none of these
coefficients reflects the asymmetric cost function inherent in the problem at hand.
Moreover, there is no unanimity on which method is the best. For example, if fourquarter
data is used, lr is the best with respect to correlation coefficient, M5 is the
best with respect to relative absolute error, and NN is the best with respect to root
relative square error.
Discussion of results
In this study, we first argue for a measure to assess the performance of cost-sensitive
regression models. Given that R
2
—which is based on quadratic squared errors—and
other traditional measures are not appropriate when errors on two sides have unequal
consequences, we proposed a measure, average misprediction cost, which weights the
two types of errors differently. We developed an algorithm for tuning a trained model
based on this misprediction cost. We ran a series of experiments using three types of
regression models (lr, NN, and M5) on loan charge-off data from u.S. banks and
found that tuning significantly reduced the misprediction cost of these models.
table 2. results Based on Four Quarters across all Cost ratios (N = 150)
Cost without tuning Cost with tuning
Standard Standard
Method Mean deviation Mean deviation
LR 5.582 4.833 4.552 4.222
NN 3.890 3.274 2.685 2.268
M5 2.669 2.002 1.823 1.015
Note: Cost values are on a scale of 10
–2
.
table 3. results Based on One Quarter across all Cost ratios (N = 80)
Cost without tuning Cost with tuning
Standard Standard
Method Mean deviation Mean deviation
LR 5.756 3.974 4.620 3.121
NN 3.295 2.711 2.376 1.419
M5 2.859 2.357 1.921 1.440
–2
.330 BaNSal, SINha, aND ZhaO
sing Four-Quarter Data u uning on Costs t able 4. Effects of t
Cost ratio
100:1 50:1 20:1 10:1 1:1
uned t ntuned u uned t ntuned u uned t ntuned u uned t ntuned u Method
6.717 10.375 5.364 6.618 4.152 4.364 3.589 3.613 2.937 LR
4.347 8.351 3.295 4.855 2.395 2.757 1.958 2.058 1.429 NN
3.113 5.949 2.285 3.379 1.590 1.836 1.268 1.322 0.859 M5
Cost values are on a scale of 10 Note:
–2
.
sing One-Quarter Data u uning on Costs t able 5. Effects of t
Cost ratio
100:1 50:1 20:1 10:1 1:1
uned t ntuned u uned t ntuned u uned t ntuned u uned t ntuned u Method
6.272 10.187 5.267 6.714 4.352 4.630 3.898 3.936 3.311 LR
3.880 7.054 2.953 4.108 2.113 2.340 1.717 1.751 1.221 NN
3.151 6.316 2.396 3.607 1.719 1.982 1.388 1.440 0.952 M5
Cost values are on a scale of 10 Note:
–2
.tuNING Data MINING MEthODS FOr COSt-SENSItIVE rEGrESSION 331
table 6. Cost-Insensitive regression Performance Measures
training data
One quarter Four quarters
Correlation coefficient
LR 0.81 0.85
M5 0.88 0.81
NN 0.86 0.82
Relative absolute error
LR 0.70 0.64
M5 0.56 0.45
NN 1.00 0.55
Root relative square error
LR 0.71 0.59
M5 0.68 0.59
NN 0.92 0.55
–2
.
One major finding is the relative consistency of the tuning algorithm across the
different cost ratios, and also across the different methods. the results of the study
clearly demonstrate the effectiveness of the approach we used for tuning the models.
the models themselves were generated without explicitly incorporating differential
costs during training of the models. that is, the trained models are independent of
costs. Only after the models were generated were their predictions adjusted to account
for costs. the advantage of the approach is that the models would remain invariant
even if the cost ratio or the asymmetric cost function changes. For example, if a linex
cost function is used instead of linlin, the trained models would remain the same; the
tuning algorithm would then find a different adjustment for minimizing the average
misprediction cost. there is therefore no need for banks to develop new models for
forecasting loan charge-offs when the cost function or cost ratio changes. a regression
model has to be generated only once; when the cost function or ratio changes,
the model is tuned by adjusting its prediction.
the performance of the tuning algorithm, along with the resultant costs, remained
consistent across the two data sets. Models trained on one-quarter data sets performed
nearly as well as those trained on four-quarter data sets. Including more quarters in
the training data set increases the sample size and enhances training. On the other
hand, older data tends to be less predictive than more recent data. using just the
previous quarter to predict the current quarter is thus adequate and computationally
more economic.
In general, we found that M5 provides the least “costly” prediction, as compared to
NN and lr. In this case, lr had the worst performance. Moreover, M5 was found to
consistently perform better across the different cost ratios as well as across the different
data sets (one quarter versus four quarters). this is useful to know, especially when 332
M5 is very efficient for training purposes; the training time for M5 is much smaller
than that for other models such as NN.
Conclusion and Future research
in t h i s s t u d y, w e Pr oPo s e d a n aP Pr o a c h for improving the efficacy of data
mining
methods for cost-sensitive regression problems. More specifically, we considered the
case of predicting loan charge-offs for u.S. banks. the results from a detailed empirical
evaluation validate the effectiveness of the proposed approach.
Past research in information systems may have ignored misprediction costs because it
is difficult to incorporate those costs into regression models during learning. however,
we did not explicitly incorporate costs into any of the regression models that were
generated using the training data. rather, we proposed an algorithm that tunes the
output of a trained model post hoc. a major contribution of our research is in making a
trained regression model cost sensitive. In addition to the fact that regression
models, tuned post hoc, do not have to be built every time the cost function or cost
ratio changes, it is also true that they are much simpler and easier to implement than
those that explicitly incorporate costs. Such a post hoc tuning method can be easily
adopted and used by decision makers for addressing real-world forecasting problems.
But an advantage of incorporating costs directly into the models during training is that
there is no need to go through the two different steps required for post hoc tuning—
model building and model tuning. Future studies could examine the effectiveness of
the post hoc tuning approach vis-à-vis the cost-incorporated approach for addressing
regression problems.
Sinha and May [18] also proposed an approach for tuning data mining models post
hoc, but it was restricted to binary classification problems. Our proposed approach, on
the other hand, is for regression problems where the dependent variable is continuous,
not categorical. Sinha and May assigned unequal costs to false positives and false
negatives. Similarly, we assigned different costs for underprediction and overprediction,
but with the difference that our costs were a function of the prediction error, whereas
Sinha and May used fixed costs for the two types of misclassification. a final difference
between the two studies is in the method used for tuning the models post hoc. For
model tuning, Sinha and May identified optimal decision thresholds of the classifiers
by using the results of receiver operating characteristic (rOC) curves. In contrast, we
adjust the predictions of regression models using a hill-climbing algorithm. In their
study, tuning was accomplished by adjusting a parameter (decision threshold), while
in our study, it is achieved by adjusting the output (prediction).
We used the linlin cost function to analyze the efficacy of our approach. Different
cost ratios were used, and the results compared to identify how the cost ratios could
impact the misprediction costs for any method. the results held across all cost ratios
and all three methods; tuning the models based on the proposed algorithm invariably
resulted in better performance.
the findings of this study have interesting implications for both research and practice.
Our study presents an approach to conducting misprediction cost analysis for regres-
tuNING Data MINING MEthODS FOr COSt-SENSItIVE rEGrESSION 333
sion problems. as discussed earlier, studies in cost-sensitive data mining have been
largely confined to classification problems, which typically use misclassification cost
as the performance measure. the findings of this research may not only be helpful to
bank practitioners trying to forecast charge-offs with the minimum possible associated
costs, but also to researchers who could exploit the misprediction cost analysis
technique discussed in this paper for future research. Practitioners would find the
concepts, ideas, and techniques generated in this study to be attractive and potentially
applicable for forecasting applications in other business domains.
In this study, we presented an empirical approach to evaluating and tuning data
mining models for regression. While most studies have used R
2
as the sole measure,
we presented a cost-sensitive approach to tuning and evaluating the methods. this
paper opens up several avenues for future research. First, while we evaluated the
proposed algorithm on one regression problem with the linlin cost function, future
research could be conducted to test our approach using other cost-sensitive regression
problems, possibly incorporating different cost functions. Second, while we proposed
a post hoc tuning method, other methods that can incorporate cost information during
training can be further investigated and compared to post hoc methods. Finally, we
used three data mining methods for studying the cost-sensitive regression problem.
Future studies could be more comprehensive by including other data mining methods
and examining if the results hold across different problem domains.
In summary, our study proposes a new measure, average misprediction cost, which
is optimized for a data mining model using a post hoc tuning algorithm. the tuning
significantly brought down the costs for all the different models across all cost ratios
and data sets, thus validating the efficacy of our approach.
no t e s
1. https://wrds.wharton.upenn.edu.
2. Month in this case represents the quarter ending at that month. For example, March
2002
represents the quarter January 2002–March 2002.
3. Variable number as denoted in the WrDS data set.
re f e r e n c e s
1. anandarajan, M. Profiling Web usage in the workplace: a behavior-based artificial
intelligence approach. Journal of Management Information Systems, 19, 1 (Summer
2002),
243–266.
2. Cares, D.C. What’s an adequate loan-loss reserve? ABA Banking Journal, 77, 3
(1985),
41–43.
3. Cavallo, M., and Majnoni, G. Do banks provision for bad loans in good times?
Empirical evidence and policy implications. Policy research Working Paper, World
Bank,
Washington, DC, 2001 (available at http://econ.worldbank.org/external/default/main?
pagePK=64165259&theSitePK=469372&piPK=64165421&menuPK=64166322&entity
ID=000094946_01071204123124).
4. Crone, S.F.; lessmann, S.; and Stahlbock, r. utility based data mining for time series
analysis: Cost-sensitive learning for neural network predictors. In G. Weiss, M. Saar-
tsechansky,
and B. Zadrozny (eds.), Proceedings of the First International Workshop on Utility-Based
Data
Mining. New York: aCM Press, 2005, pp. 59–68.334 BaNSal, SINha, aND ZhaO
5. Davids, l.E. Dictionary of Banking and Finance. lanham, MD: rowman and littlefield,
1978.
6. Domingos, P. Meta cost: a general method for making classifiers cost sensitive. In u.
Fayyad, S. Chaudhuri, and D. Madigan (eds.), Proceedings of the Fifth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. New York: aCM
Press, 1999,
pp. 155–164.
7. Downes, J., and Goodman, J.E. Dictionary of Finance and Investment. hauppauge,
NY:
Barron’s Educational Series, 2006.
8. Elkan, C. the foundations of cost-sensitive learning. In B. Nebel (ed.), Proceedings of
the
Seventeenth International Joint Conference on Artificial Intelligence. San Francisco:
Morgan
Kaufmann, 2001, pp. 973–978.
9. handorf, W.C., and Zhu, l. Credit risk management and bank size. Commercial
Lending
Review, 20, 1 (2005), 27–34.
10. henderson, C.C. the economic performance of african-american-owned banks: the
role
of loan loss provisions. American Economic Review, 89, 2 (1999), 372–376.
11. Jain, B.a., and Nag, B.N. Performance evaluation of neural network decision models.
Journal of Management Information Systems, 14, 2 (Fall 1997), 201–216.
12. Keeton, W.r., and Morris, C.S. Why do banks’ loan losses differ? Economic Review,
72,
5 (1987), 3–21.
13. Kim, C.N., and Mcleod, r. Expert, linear models, and nonlinear models of expert
decision
making in bankruptcy prediction: a lens model analysis. Journal of Management
Information
Systems, 16, 1 (Summer 1999), 189–206.
14. Provost, F.; Fawcett, t.; and Kohavi, r. the case against accuracy estimation for
comparing induction algorithms. In J.W. Shavlik (ed.), Proceedings of the Fifteenth
International
Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1998, pp. 445–
453.
15. Quinlan, J.r. learning with continuous classes. In N. adams and l. Sterling (eds.),
Proceedings of the Fifth Australian Join Conference on Artificial Intelligence. Singapore:
World
Scientific, 1992, pp. 343–348.
16. rumelhart, D.E.; hinton, G.E.; and Williams, r.J. learning internal representations by
error propagation. In D.E. rumelhart and J.l. McClelland (eds.), Parallel Distributed
Processing. Cambridge, Ma: MIt Press, 1986, pp. 318–362.
17. Sarkar, S., and Sriram, r.S. Bayesian models for early warning of bank failures.
Management Science, 47, 11 (2001), 1457–1475.
18. Sinha, a.P., and May, J.h. Evaluating and tuning predictive data mining models using
receiver operating characteristic curves. Journal of Management Information Systems,
21, 3
(Winter 2004–5), 249–280.
19. Sinkey, J.F. Identifying “problem” banks: how do the banking authorities measure a
bank’s
risk exposure? Journal of Money, Credit and Banking, 10, 2 (1978), 184–193.
20. Sung, t.K.; Chang, N.; and lee, G. Dynamics of modeling in data mining: Interpretive
approach to bankruptcy prediction. Journal of Management Information Systems, 16, 1
(Summer 1999), 63–86.
21. tam, K.Y., and Kiang, M.Y. Managerial applications of neural networks: the case of
bank failure predictions. Management Science, 38, 7 (1992), 926–947.
22. thompson, r.D., and Basu, a.P. asymmetric loss functions for estimating system
reliability. In D.a. Berry, K.M. Chaloner, and J.K. Geweke (eds.), Bayesian Analysis in
Statistics
and Econometrics. New York: John Wiley & Sons, 1996, pp. 471–482.
23. ting, K.M. an instance-weighting method to induce cost-sensitive trees. IEEE
Transactions on Knowledge and Data Engineering, 14, 3 (2002), 659–665.
24. Varian, h.r. a Bayesian approach to real estate assessment. In S.E. Fienberg and a.
Zellner (eds.), Studies in Bayesian Econometrics and Statistics: In Honor of Leonard J.
Savage.
amsterdam: North-holland, 1974, pp. 195–208.
25. Witten, I.h., and Frank, E. Data Mining: Practical Machine Learning Tools and
Techniques, 2d ed. San Francisco: Morgan Kaufmann, 2005.
26. Zellner, a. Bayesian estimation and prediction using asymmetric loss functions.
Journal
of American Statistics Association, 81, 394 (1986) 446–451.
27. Zhao, h. a multi-objective genetic programming approach to developing pareto
optimal
decision trees. Decision Support Systems, 43, 3 (april 2007), 809–826.tuNING Data
28. Zhou, l.; Burgoon, J.K.; twitchell, D.P.; Qin, t.; and Nunamaker, J. a comparison of
classification methods for predicting deception in computer-mediated communication.
Journal
of Management Information Systems, 20, 4 (Spring 2004), 139–165.
appendix: Propositions
Proposition 1
th e a v e r aGe mi sPr e d i c t i o n c o s t o f a n a dJu s t e d r eGr e s s i o n m o d e l,
θ(δ), is a convex
function with regard to the amount of adjustment, δ, if the cost function, C(e), is a
convex function with regard to prediction error, e.
Proof
Since the adjustment, δ, is applied on a regression model, f, after the model has been
trained, the prediction of f on a given problem instance, f (x
i
), i = 1, 2, ..., N, is constant
irrespective of δ. the prediction error of f on a problem instance, f (x
i
)–y
i
, i = 1, 2, ...,
N, is therefore also constant.
If the cost function, C(e), is convex with regard to prediction error, e, the misprediction
cost of the adjusted model, f′, on a problem instance, C(δ + (f (x
i
)–y
i
)), i = 1,
2, ..., N, is convex with regard to the adjustment, δ. the summation
Cfxy
ii
i
N
δ+(())−
=
∑
1
is then convex with regard to δ. the average misprediction cost of the adjusted
model,
θ δ( ) = + δ ( ( ) ) −
=
∑
1
N1
Cfxy
ii
i
N
,
is therefore convex with regard to δ. Q.E.D.
Proposition 2
the worst-case time complexity of the performance tuning algorithm listed in Figure 3
is O([log n]
2
), assuming there are n possible δ values that need to be evaluated by a
brute force algorithm.
Proof
It is apparent that most of the computation time for finding the optimal δ value is
spent on the hill-climbing procedure (lines 6 to 7), while the time spent on the initial
determination of the climbing direction (lines 2 to 5) is negligible. the training of the
original regression model (line 1) is out of the scope of the performance tuning.336
the entire hill-climbing procedure consists of several search phases (the outer loop).
During each search phase (an iteration of the outer loop), several trials are made (the
inner loop, lines 6.3 to 6.4). the stride of each subsequent trial doubles that of the
previous trial (line 6.3.3). In the worst case, the last trial reaches the outermost boundary
of the current search phase. Similar to binary search, the number of trials during the
first search phase is at most log
2
n, assuming there are n possible δ values that need to
be evaluated by a brute force algorithm. Each search phase reduces the search range
by at least half. again, similar to binary search, the number of search phases is also
at most log
2
n. the total number of trials over all search phases is at most
log
2
n + log
2
n/2 + ... + 1.
the worst-case time complexity of the hill-climbing procedure is therefore
O(log
2
n + log
2
n/2 + ... + 1)
= O([log
2
n][log
2
n + 1]/2)
= O([log
2
n]
2
) (by discarding constant and lower-order terms)
= O([log n]
2
) (by discarding the base of log).
Q.E.D.

Info System-Datamining

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Info System-Datamining

Transféré par

Droits d'auteur :

Formats disponibles

Journal of Management Information Systems / Winter 2008–9, Vol. 25, No. 3, pp.

Vous aimerez peut-être aussi