Vous êtes sur la page 1sur 21

Data Mining for Business Intelligence

Lecture 1: Introduction to Data


Mining
MIS 545
Data Mining for Business Intelligence
LECTURE OUTLINE:
Syllabus
Data Mining Overview
Definition
How it relates to knowledge extraction and intelligence?
Origins of Data Mining
Tasks
Predictive vs. descriptive
Challenges

Steps in Data Mining


What is Data Mining?
Data mining is the process of discovering knowledge in
large data repositories
Many other definitions:
Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
Misnomer?
Information Hierarchy
Intelligence

Knowledge

Information

Data
Data Mining Applications
Lots of data being collected
and warehoused
Web data, e-commerce
Social Networks
purchases at department/
grocery stores
Bank/Credit Card
transactions
Government agencies

Computers have become cheaper and more powerful


Source: Davenport and Harris, Competing on Analytics, HBS Press
What is (not) Data Mining?
(Extraction vs. Analytics)
What is not Data What is Data Mining?
Mining?
Certain names are more
Look up phone
prevalent in certain US
number in phone
locations (OBrien,OReilly
directory
in Boston area)
Obtain the list Predict whether the bank
of loans that should approve a loan
went default in application or not (whether
the last 12 the loan will go default or
months not)
Origins of Data Mining
Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
Traditional Techniques
may be unsuitable due to
Statistics Machine Learning/
Enormity of data AI/Pattern
High dimensionality Recognition
of data
Data Mining
Heterogeneous,
distributed nature
of data Database
systems
Data Mining Tasks
Predictive Tasks
Use some variables
(explanatory/independent/input variables) to
predict unknown or future values of a particular
variable (target/dependent variable)

Descriptive Tasks
Find general properties that describe the data
Data Mining Tasks
Classification [Predictive]
Regression [Predictive]
Visualization [Descriptive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Graph Mining / Social Networks [Descriptive]
Classification: Example
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers
likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise. This {buy, not buy} binary decision forms the class
attribute.
Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
Use this information as input attributes to learn a classifier model.
To predict class attribute value of new customers, given their input
attributes known.
Classification: Example
Customer Churn/Attrition:
Goal: To predict whether a customer is likely to be lost to a
competitor.
Approach:
Use detailed record of transactions with each of the past and present
customers, to find attributes.
How often the customer calls, where he calls, what time-of-the-day he calls
most, his financial status, marital status, etc.
Label the customers as loyal of disloyal.
Find a model for loyalty.
Regression/Prediction: Example
Predict a value of a given continuous valued variable
based on the values of other variables, assuming a linear
or nonlinear model of dependency.
Greatly studied in statistics, econometrics, neural
network fields.
Examples:
Predicting sales amounts of new product based on advertising
expenditure.
Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
Time series prediction of stock market indices (forecasting).
Clustering: Example
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical
and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of
customers in same clusters vs. those from different cluster.
Clustering: Example
Document Clustering:
Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the frequencies
of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a
new document or search terms to clustered documents.
Association Rule Mining: Example
Given a set of record each of which contain some
number of items from a given collection;
Produce dependency rules which will predict occurrence of an
item based on occurrence of other items.
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
CRISP-DM
Cross Industry Standard Process for Data Mining
Steps in Data Mining
1. Develop an understanding of the purpose of the data mining project
2. Obtain the data set to be used in the analysis
Random sampling from a large database to capture records
While data mining deals with very large databases
usually the analysis to be done requires only thousands or tens of thousands of records
3. Explore, clean, and preprocess the data
This involves verifying that the data are in reasonable condition
How should missing data be handled?
Are the values in a reasonable range, given what you would expect for each variable?
Are there obvious outliers?
The data are reviewed graphically - for example, a matrix of scatter plots showing the
relationship of each variable with each other variable
4. Reduce the data, if necessary
Where supervised training is involved
separate it into training, validation and test data sets
eliminate unneeded variables
transforming variables
creating new variables
Steps in Data Mining cont.
5. Determine the data mining task
classification, prediction, clustering, etc.
6. Choose the data mining techniques to be used
Decision trees, Nave Bayes, Hierarchical Clustering, etc.
7. Use algorithms to perform the task
This is typically an iterative process
Choosing different variables or settings within the algorithm
8. Interpret the results of the algorithms
Each algorithm may also be tested on the validation data for tuning purposes
validation data becomes a part of the fitting process!
likely to underestimate the error in the deployment of the model that is finally
chosen
9. Deploy the model in real world
For example, the model might be applied to a purchased list of possible
customers
action might be include in the mailing if the predicted amount of purchase is
> $10
Review Questions
What is the difference between data and intelligence?

Is querying for the information on the Internet


considered data mining? What constitutes a data mining
activity?

What is the difference between predictive and descriptive


data mining activities?

What are the practical uses of data mining techniques:


classification, clustering and association rule mining?

Vous aimerez peut-être aussi