Vous êtes sur la page 1sur 31

CS359 Introduction to Data


Course objectives

This course introduces the fundamental concepts of

data mining and knowledge discovery from
It focuses on the discussion and demonstration of
common data mining methods and how data mining
results become useful to businesses and

Grading scheme
Class Standing

10% Assignment
25% Cases
40% Quizzes
25% Long Quiz

Midterm = 70% Class Standing + 30% Midterm Exam

2nd Quarter = 70% 2nd Quarter Class Standing + 30% Final
Final Grade = 40% Midterm + 40% Final Grade + 20% Project

Attendance will be checked.
No make-up quizzes
Make-up long exam only for excused absence.
Set schedule within a week after the exam date

Late submissions will not be accepted (assignments,

cases and project)

Han, J. & Kamber, M. (2006) Data Mining Concepts
and Techniques 2nd Edition. Morgan Kaufmann
Publisher Elsevier Inc., California.
P. Tan, M. Steinbach & V. Kumar, Introduction to Data
Mining, Addison Wesley, 2006.

Software Links
Data Mining Software Links by Dr. Pang-Ning Tan :
RapidMiner : http://rapidi.com/content/view/26/84/lang,en/
Weka : http://www.cs.waikato.ac.nz/ml/weka/

Data Mining Processes and

Knowledge Discovery

Define Data Mining and knowledge discovery in
Discuss some business applications of data mining
Identify the elements of the data mining process
Discuss the steps in CRISP-DM

What is Data Mining?

Is also known as Knowledge Discovery in Databases; a
nontrivial extraction of implicit, previously unknown
and potentially useful information from databases
(Han et al, 1999)
Involves the use of analysis to detect patterns and
allow predictions. (Olson & Shi, 2007)

Data Mining
Exploratory data analysis
Finds its roots along with the development in classical
statistics, artificial intelligence and machine learning
Looks for actionable information, or information that
can be utilized in a concrete way to improve

General Types of Data Mining

Hypothesis Testing
A theory about the relationship between actions and
outcomes is expressed and tested

Knowledge Discovery
Preconceived notion may not be present
Relationships can be identified by looking in to the data

Data Mining requires the identification of a


Data Mining Applications

Affinity Positioning based upon the identification of
products that the same customer is likely to want
Cross-selling knowledge of products that go together
can be used by marketing the complementary product

Data Mining Applications

Customer Relationship Management identify customer
value, develop programs to maximize revenue

Credit Card Management

Identify Balance Surfers or credit card holders who pays
old balances with a new card
Lift identify effective market segments
Churn identify likely customer turnover

Data Mining Applications

Fraud detection identify fraud claims meriting

Churn customer turnover or switching carriers

Cancer Cell Detection

Machine Vision
Pattern Recognition

CRISP-DM Process
Cross-Industry Standard Process for Data Mining
Business Understanding
Data Understanding
Data Preparation

Business Understanding
Knowing what the study is for
Identify business task

Data Understanding
Select the related data from many available
databases to correctly describe a given
business task

Identify relevant data for the problem description

Selected variables for the relevant data should be
independent of each other or do not contain
overlapping information

Types of data: geographic, socio-graphic,

transactional or quantitative and qualitative

Data Preparation
Also known as data preprocessing
Clean selected data for better quality
Filter, aggregate and fill in missing values (imputation)
Filter: remove outliers and redundancies
Aggregate: data is reduces to obtain aggregated
Filling-in or Smoothing: missing values are found and
replaces with reasonable values

Data Preparation
Data transformation
Uses mathematical formulations to convert
different measurements into a unified numerical
Numerical to numerical scales
Shrink or enlarge the data

Categorical to numerical scales

Categorical values can be ordinal (less, moderate, strong)
or nominal (red, yellow, blue)

Data mining software is used to generate results for
various situations
Data is divided into:
Training set used for the development of the model
Test set used to test the model thats built

Data Modeling Techniques
Association the relationship of a particular item in a
data transaction on other items in the same transaction
is used to predict patterns
Classification learning different functions that map
each item of the selected data into one of a predefined
set of classes

Clustering takes ungrouped data and uses automatic
techniques to put this data into groups
Prediction Analysis discover the relationship between
the dependent and independent variables
Sequential Pattern Analysis seeks to fine similar
patterns in data transaction over a business period

Data interpretation stage
Two things to consider:
How to recognize business value from knowledge
patterns discovered
How to visualize the results to properly interpret

The results are reported to project sponsors
The result is applied to business task or data mining

Knowledge Discovery Process

Data Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Presentation

Data Mining System Architecture

User Interface

Pattern Evaluation
Data Mining Engine

Database or Data Warehouse Server

Data cleaning, Integration and Selection





Data Mining on what data?

Relational Databases
Data Warehouses
Transactional Databases
Object-Relational Databases
Temporal, Sequence or Time-Series Database
Spatial Databases and Spatiotemporal Databases

Data Mining - what patterns?

Descriptive characterize the general properties of
Data characterization, Data discrimination, Association,

Predictive performs inference on the current data in

order to make predictions
Classification and Prediction, Evolution analysis

Are all patterns interesting?

A pattern is interesting if
(1) it is easily understood by humans,
(2) valid on new or test data with some degree of
(3) potentially useful, and
(4) novel.

A pattern is also interesting if it validates a

hypothesis that the user sought to confirm.

Can a data mining system generate

all interesting patterns?
Refers to COMPLETENESS of a data mining algorithm
It is unrealistic and inefficient for data mining systems
to generate all of the possible patterns.
A focused search which makes use of interestingness
measures should be used to control pattern

CASE study: Telephone company


What is the business task or data mining objective?

What are the relevant data and their sources?
How was the data prepared? What were the
What was the data mining technique used?
How was the model used to address the business