Review of Fraud Classification Using PCA and RIDITS

Review of Fraud Classification Using Principal Components Analysis of RIDITS
By Louise A. Francis
Francis Analytics and Actuarial Data Mining, Inc.
Objectives
Address question: Why use new method, PRIDIT? Introduce other methods used in similar circumstances Explain how PRIDIT adds to methods available Explain limitations of PRIDIT/RIDIT
A Key Problem in Fraud Modeling

Most data mining methods need a target (dependent) variable
Y = a + b1x1 + b2x2 + bnxn Fraud (Yes/No or Fraud Score) = f(predictor variables)
Need sample of data where claims have been determined to be fraudulent or legitimate
Dependent variable hard to get

In a large sample of automobile insurance claims perhaps 1/3 may have an element of abuse or fraud Scarce resources are not expensed on such large volumes of claims to determine their legitimacy
Only a small percentage referred to SIU investigators or other investigations There are time lags in determining the outcome of investigations
Unsupervised learning
Another approach that does not require a dependent variable Two Key Kinds
Cluster Analysis Principal Components/Factor Analysis
Pridit uses this approach It is applied to ordered categorical variables
Cluster Analysis
Records are grouped in categories that have similar values on the variables Examples
Marketing: People with similar values on demographic variables (i.e., age, gender, income) may be grouped together for marketing Text analysis: Use words that tend to occur together to classify documents
Note: no dependent variable used in analysis
Clustering
Common Method: k-means, hierarchical No dependent variable records are grouped into classes with similar values on the variable Start with a measure of similarity or dissimilarity Maximize dissimilarity between members of different clusters
Dissimilarity (Distance) Measure Continuous Variables

Euclidian Distance
dij
1/ 2 m 2 ( xik x jk ) i, j = records k=variable k 1
Manhattan Distance
dij
m xik k 1
x jk
Binary Variables
Row Variable 1 0 1 a b a+b 0 c d c+d a+c b+d
Column Variable
Binary Variables
Sample Matching
bc d abcd
Rogers and Tanimoto

2(b c) d (a d ) 2(b c)
Example: Fraud Data

Data from 1993 closed claim study conducted by Automobile Insurers Bureau of Massachusetts Claim files often have variables which may be useful in assessing suspicion of fraud, but a dependent variable is often not available Variables used for clustering:
Legal representation Prior Claim SIU Investigation At fault Police report Number of providers
Statistics for Clusters

Based on descriptive statistics, Cluster 2 appears to have higher likelihood of fraudulent claims more about this later
Police Medical At Legal SIU Number Cluster Report Audit Fault Rep Investigation Providers Percentage Yes 1 46.7% 0.1% 42.2% 6.1% 0.0% 2 2 49.8% 5.9% 2.4% 96.0% 6.5% 4
Principal Components Analysis

A form of dimension (variable) reduction Suppose we want to combine all the information related to the financial dimension of fraud
Medical provider bill (indicative of padding claim) Hospital bill Number of providers Economic Losses Claimed wages Incurred Losses
Principal Components
These variables are correlated but not perfectly correlated We replace many variables with a weighted sum of the variables
Correlation Matrix for Variables

Correlations Number Medical Provider Economic Hospital Providers Bill Paid Losses Incurred Pymt Number Providers Medical Bill Provider Paid Economic Losses Inourred Hospital Pymt 1.000 0.387 0.571 0.382 0.382 0.168 0.387 1.000 0.539 0.952 0.952 0.922 0.571 0.539 1.000 0.531 0.531 0.327 0.382 0.952 0.531 1.000 1.000 0.888 0.382 0.952 0.531 1.000 1.000 0.888 0.168 0.922 0.327 0.888 0.888 1.000
Finding Factor or Component

The correlation matrix is used to find the factor that explains the most variance (captures most of the correlation) for the set of variables That component or factor extracted will be a weighted average of the variables More than one Component or Factor may result from applying the method
Evaluating Importance of Variables

Use factor loadings
Component Matrix Variable Loading Number Providers 0.497 Medical Bill 0.974 Provider Paid 0.646 Economic Losses 0.976 Incurred 0.976 Hospital Pymt 0.886
Problem: Categorical Variables

It is not clear how to best perform Principal Components/Factor Analysis on categorical variables
The categories may be coded as a series of binary dummy variables If the categories are ordered categories, you may loose important information
This is the problem that PRIDIT addresses
RIDIT
Variables are ordered so that lowest value is associated with highest probability of fraud Use Cumulative distribution of claims at each value, i, to create RIDIT statistic for claim t, value i
Rti
tj p
j i
tj p
j i
Example: RIDIT for Legal Representation
Legal Representation Proportion Proportion
Value
Yes No
Code Number Proportion

1 2 706 694 0.504 0.496
Below 0.000 0.504
Above RIDIT 0.496 -0.496 0.000 0.504
PRIDIT
Use RIDIT statistics in Principal Components Analysis
Component Matri xa
C om pon e n t 1 S IU Pol i ce Re port At Faul t Le gal Re p Medi cal Audi t Pri or C l ai m .248 .220 .709 .752 .341 .406
Extracti on Me th od: Pri n ci pal Com pon e n t An al ys i s. a. 1 component s ext r act ed.
Scoring
Assign a score to each claim The score can be used to sort claims
More effort expended on claims more likely to be fraudulent or abusive
In the case of AIB data, we can use additional information to test how well PRIDIT did, using the PRIDIT score
A suspicion score was assigned to each claim by an expert
PRIDIT vs. Suspicion Score

Suspicion Score vs PRIDIT Score
1.00
PRIDIT Score
0.50 0.00
0. 00
1. 00
2. 00
3. 00
4. 00
5. 00
6. 00
7. 00
8. 00
9. 00
(1.00) (1.50) Suspicion Score
10 .0 0
(0.50)
Clustering and Suspicion Score

Report
Mean Suspicion Le ve l .6445 3.3737 1.9643
1 TwoStep 2 C luste r Number Total
Result
There appears to be a strong relationship between PRIDIT score and suspicion that claim is fraudulent or abusive The clusters resulting from the cluster procedure also appeared to be effective in separating legitimate from fraudulent or abusive claims
Comparison: PRIDIT and Clustering

PRIDIT gives a score, which may be very useful for claims sorting. Clustering assigns claims to classes. They are either in or out of the assigned class. Clustering ignores information about the order of values for categorical variables Clustering can accommodate both categorical and continuous variables
Comparison
Unordered categorical variables with many values (i.e., injury type):
Clustering has a procedure for measuring dissimilarity for these variables and can use them in clustering If the values for the variables contain no meaningful order, PRIDIT will not help in creating variables to use in Principal Components Analysis.
Review of Fraud Classification Using Principal Components Analysis of RIDITS

By Louise A. Francis
Francis Analytics and Actuarial Data Mining, Inc.

Review of Fraud Classification Using PCA and RIDITS

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Review of Fraud Classification Using PCA and RIDITS

Transféré par

Droits d'auteur :

Formats disponibles

Review of Fraud Classification Using Principal Components Analysis of RIDITS

A Key Problem in Fraud Modeling

Dependent variable hard to get

Note: no dependent variable used in analysis

Dissimilarity (Distance) Measure Continuous Variables

1/ 2 m 2 ( xik x jk ) i, j = records k=variable k 1

Row Variable 1 0 1 a b a+b 0 c d c+d a+c b+d

Rogers and Tanimoto

Example: Fraud Data

Statistics for Clusters

Principal Components Analysis

Correlation Matrix for Variables

Finding Factor or Component

Evaluating Importance of Variables

Problem: Categorical Variables

This is the problem that PRIDIT addresses

Example: RIDIT for Legal Representation

Legal Representation Proportion Proportion

Code Number Proportion

Below 0.000 0.504

Above RIDIT 0.496 -0.496 0.000 0.504

PRIDIT vs. Suspicion Score

(1.00) (1.50) Suspicion Score

Clustering and Suspicion Score

1 TwoStep 2 C luste r Number Total

Comparison: PRIDIT and Clustering

Review of Fraud Classification Using Principal Components Analysis of RIDITS

Vous aimerez peut-être aussi