Vous êtes sur la page 1sur 5

DATA MINING

Data Mining:
Intelligent methods are applied to extract the
useful information or patterns
Data Mining: A KDD Process:
Data mining: the core of knowledge discovery
process.
Steps of a KDD Process
Data Cleaning
Handles Noisy, Inconsistent, Incomplete data
Missing Values
Noisy data
Binning, Clustering etc.
Inconsistencies
Tools, functional dependencies

Data Integration
Schema Integration

Entity Identification problem


Redundancy
Correlation Analysis

Data Selection
Select only the task relevant data

Data Transformation
Transform or consolidate data
Smoothing, Normalization, Feature Construction
Data Reduction Compression

Pattern Evaluation
Interestingness Measures

Knowledge Presentation
Visualization

Data Mining Functionalities:

Descriptive
Characterize general properties of the data

Predictive

Performs inference

Mining

Parallel
Various Granularities

Concept/class description
Association Analysis
Classification and Prediction
Cluster Analysis
Outlier Analysis
Evolution Analysis

Concept/ Class Description:

Data can be associated with Classes / Concepts


Computers, Printers
BigSpenders Vs BudgetSpenders

Class / Concept Description

Classes and Concepts can be summarized in

concise and precise terms


Data Characterization
Data Discrimination

Data Characterization:

Summarization of the general characteristics


Data collected and aggregated
OLAP roll up operation
Attribute Oriented Induction
Results Charts, cubes, rules
Example
Characteristics of Customers

Data Discrimination:

Compare target class and contrasting classes


Maybe user specified
Examples:
Products whose sales increased Vs decreased
Regular Shoppers Vs Occasional Shoppers

Output includes Comparative measures


Association Analysis:

Discovery of association rules


Form: X Y
Multi-dimensional
Age(X, 2029)

buys(X, Laptop)
Single Dimensional

Classification and Prediction:

Classification
Finds models that describe and differentiate

classes or concepts
Predicts class
Training data
Models rules, decision trees, NN, formulae
Preceded by relevance analysis (to eliminate
irrelevant attributes)
Prediction
Derived model is used for prediction
Data value prediction
Class label prediction (Classification)
Trend identification
Cluster Analysis

Unsupervised
Class labels are missing in the training set
Maximize Intra-class similarity
Minimize Inter-class similarity
Hierarchy of classes

Outlier Analysis

Objects that do not comply with the general behavior


Noise Vs Rare events
Fraud detection
Statistical tests
Deviation based methods

Evolution Analysis:

Trend detection
Time series data
Involves other functionalities