Vous êtes sur la page 1sur 36

CS5550 Data Management and Business Intelligence

Session 7: Data Mining (II) Advanced DM

Session Learning Outcomes


The learning outcomes for this session are that you can:

Understand how to use state-of-the-art DM techniques Discuss the strengths and weaknesses of such methods Discuss the use of key DM tools for business intelligence

CS5550 Session 05

Slide 2

Recap on Data Mining


What it is

Definition IDA, AI, KDD, etc Data to knowledge


Correlation Regression Clustering Visualisation

Some typical tools


CS5550 Session 05

Slide 3

More Advanced Techniques


Classifiers (e.g. Decision Trees) Association Rules Time-Series Models Bayesian Networks Principal Components Analysis to plot multidimensional data Graph Based Methods to explore multiple relationships Optimisation

Classification
What sort of data is this? Similar to Clustering but

Supervised Learning we have sample classes to learn from:

Fraudulent Financial Reporting Y = {fraudulent, truthful} Predicting Delayed Flights Y = {delayed, on time}

Classification

Supervised method unlike clustering

0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.25

x
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2

Decision Trees

Established method for classifying data Originated from biology Easy to understand Commonly used for
60 50 40

Salary

Fraud detection Credit Rating

30 20 Class 0 10 0 0 20 40 Age 60 80 Class 1 ?

CS5550 Session 05

Slide 7

Decision Trees
Credit Rating example

CS5550 Session 05

Slide 8

Decision Trees
Advantages

Transparent & Interpretable Can also perform Feature Selection Can model complex relationships (non-linear)
Risk over-fitting data (but can prune trees) Require lots of data Cannot model diagonal relationships (splits always on one predictor, not combinations)

Disadvantages

K Nearest Neighbour

Find K observations in the data that are similar to the new observation we wish to classify Requires:

Distance Metric Voting Mechanism Weighting Function

CS5550 Session 05

Slide 10

K-Nearest Neighbour

Distance Metric e.g. Euclidian Weighting function Neighbour voting mechanism e.g. Maximum

k=1
+ + -

k=3
+ -

k=5
+ + -

+
+ o -

+
+ o +

+
-

o -

K-Nearest Neighbour
Advantages

Simplicity Few assumptions about data

Disadvantages

Slow when large number of datapoints Need lots of data when lots of predictors

Other Classifiers

Linear Classifiers Artificial Neural Network Classifiers Support Vector Machines Bayesian Classifiers

CS5550 Session 05

Slide 13

Association Rules
What data goes with what
Large amount of basket data

e.g. Supermarket purchases

Looks for associations between items Builds If Then Rules

e.g. Nappies => Beer

Uses notion of support and confidence


CS5550 Session 05 Slide 14

Association Rules
Example:

Rule 1. If the quality of the management is medium, then the company may have a profit or a loss (C3, C4). Rule 2. If the quality of the management is (at least) high and the number of employees is similar to 700, then the company makes a profit (C1). Rule 3. If the quality of the management is (at most) low, then the company has a loss (C5, C6). Rule 4. If the number of employees is similar to 420 and the localization is B, then the company has a loss (C2).
CS5550 Session 05 Slide 15

Association Rules
Advantages

Disadvantages
Profusion of rules generated Ignores rare (but potentially interesting) combinations

CS5550 Session 05 Slide 16

Neural Networks

Map inputs to outputs using weights Back-propagation algorithm to learn weights

Output Layer

Hidden Layer

Ii
CS5550 Session 05

Input Layer

Slide 17

Neural Networks

Forecasting Markets Predicting Stock collapse Classifying exceptional behaviour in customers


Fraudulent credit card usage Online monitoring

CS5550 Session 05

Slide 18

Neural Networks
Advantages

Can model complex relationships Versatile

Disadvantages

Suffers very badly from Over-fitting Need lots of data Do not select features automatically Black Box model

Time-Series Models
Predicting the Stock-Market
Long been the goal of:

mathematicians statisticians computer scientists philosophers

Pi Harvest Filmworks 1999


CS5550 Session 05 Slide 20

Time-Series Models

Statistical Models AI Models such as Neural Networks

CS5550 Session 05

Slide 21

Bayesian Networks

Overcome black box nature of NNs Model data using probabilities and graphs No hidden layers or weights
Models a joint distribution probability of any event is calculable

CS5550 Session 05

Slide 22

Bayesian Networks

CS5550 Session 05

Slide 23

Bayesian Networks

CS5550 Session 05

Slide 24

Bayesian Networks
Advantages?

Disadvantages?

CS5550 Session 05

Slide 25

Optimisation
For searching through huge numbers of possible solutions:

Scheduling processes

Manufacturing Deliveries Efficient loading of crates prior to shipping

Bin Packing of objects

Routing for efficient delivery

CS5550 Session 05

Slide 26

Optimisation
Well known Techniques:

Greedy Searches Hill Climb Simulated Annealing Genetic Algorithms Gradient Descent

CS5550 Session 05

Slide 27

Optimisation
For example: Travelling Salesman Problem

Famous NP Hard Problem


CS5550 Session 05 Slide 28

Travelling Salesman Problem

Optimisation: Bin Packing


Trucks with capacity of 10 How few required to store objects of size: {3, 6, 2, 1, 5, 7, 2, 4, 1, 9}?

CS5550 Session 05

Slide 30

Optimisation: Bin Packing


Search techniques for finding the best allocation for objects within fixed size containers Potential Heuristic Approaches:

First Fit Next Fit Best Fit Worst Fit

CS5550 Session 05

Slide 31

Optimisation: Bin Packing


Also 2D and 3D approaches

CS5550 Session 05

Slide 32

Business Intelligence
Data Integration + Data Mining + Human Expertise => Business Intelligence:

Improved Decision Making Quicker Response Times Better Broadcasting / Marketting

CS5550 Session 05

Slide 33

Weaknesses of Data Mining


Data Quality Spurious Correlations Over-fitting Black Box Modelling Over-reliance slave to the data Cant see the wood for the trees

Session Summary
This session has examined: Advanced Data Mining Techniques with examples Advantages and Disadvantages

CS5550 Session 05

Slide 35

Next Session: Guest Lecture

Case Study of the application of BI

Slide 36

Vous aimerez peut-être aussi