Académique Documents
Professionnel Documents
Culture Documents
Understand how to use state-of-the-art DM techniques Discuss the strengths and weaknesses of such methods Discuss the use of key DM tools for business intelligence
CS5550 Session 05
Slide 2
CS5550 Session 05
Slide 3
Classifiers (e.g. Decision Trees) Association Rules Time-Series Models Bayesian Networks Principal Components Analysis to plot multidimensional data Graph Based Methods to explore multiple relationships Optimisation
Classification
What sort of data is this? Similar to Clustering but
Fraudulent Financial Reporting Y = {fraudulent, truthful} Predicting Delayed Flights Y = {delayed, on time}
Classification
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.25
x
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2
Decision Trees
Established method for classifying data Originated from biology Easy to understand Commonly used for
60 50 40
Salary
CS5550 Session 05
Slide 7
Decision Trees
Credit Rating example
CS5550 Session 05
Slide 8
Decision Trees
Advantages
Transparent & Interpretable Can also perform Feature Selection Can model complex relationships (non-linear)
Risk over-fitting data (but can prune trees) Require lots of data Cannot model diagonal relationships (splits always on one predictor, not combinations)
Disadvantages
K Nearest Neighbour
Find K observations in the data that are similar to the new observation we wish to classify Requires:
CS5550 Session 05
Slide 10
K-Nearest Neighbour
Distance Metric e.g. Euclidian Weighting function Neighbour voting mechanism e.g. Maximum
k=1
+ + -
k=3
+ -
k=5
+ + -
+
+ o -
+
+ o +
+
-
o -
K-Nearest Neighbour
Advantages
Disadvantages
Slow when large number of datapoints Need lots of data when lots of predictors
Other Classifiers
Linear Classifiers Artificial Neural Network Classifiers Support Vector Machines Bayesian Classifiers
CS5550 Session 05
Slide 13
Association Rules
What data goes with what
Large amount of basket data
Association Rules
Example:
Rule 1. If the quality of the management is medium, then the company may have a profit or a loss (C3, C4). Rule 2. If the quality of the management is (at least) high and the number of employees is similar to 700, then the company makes a profit (C1). Rule 3. If the quality of the management is (at most) low, then the company has a loss (C5, C6). Rule 4. If the number of employees is similar to 420 and the localization is B, then the company has a loss (C2).
CS5550 Session 05 Slide 15
Association Rules
Advantages
Disadvantages
Profusion of rules generated Ignores rare (but potentially interesting) combinations
Neural Networks
Output Layer
Hidden Layer
Ii
CS5550 Session 05
Input Layer
Slide 17
Neural Networks
CS5550 Session 05
Slide 18
Neural Networks
Advantages
Disadvantages
Suffers very badly from Over-fitting Need lots of data Do not select features automatically Black Box model
Time-Series Models
Predicting the Stock-Market
Long been the goal of:
Time-Series Models
CS5550 Session 05
Slide 21
Bayesian Networks
Overcome black box nature of NNs Model data using probabilities and graphs No hidden layers or weights
Models a joint distribution probability of any event is calculable
CS5550 Session 05
Slide 22
Bayesian Networks
CS5550 Session 05
Slide 23
Bayesian Networks
CS5550 Session 05
Slide 24
Bayesian Networks
Advantages?
Disadvantages?
CS5550 Session 05
Slide 25
Optimisation
For searching through huge numbers of possible solutions:
Scheduling processes
CS5550 Session 05
Slide 26
Optimisation
Well known Techniques:
Greedy Searches Hill Climb Simulated Annealing Genetic Algorithms Gradient Descent
CS5550 Session 05
Slide 27
Optimisation
For example: Travelling Salesman Problem
CS5550 Session 05
Slide 30
CS5550 Session 05
Slide 31
CS5550 Session 05
Slide 32
Business Intelligence
Data Integration + Data Mining + Human Expertise => Business Intelligence:
CS5550 Session 05
Slide 33
Session Summary
This session has examined: Advanced Data Mining Techniques with examples Advantages and Disadvantages
CS5550 Session 05
Slide 35
Slide 36