Vous êtes sur la page 1sur 42

Data Mining and

Business Intelligence
1st Lecture

Iraklis Varlamis
1
About the course

•Lectures: Monday 5:00-7:30


(+visits to the lab at the 2nd floor)

•Office: 5.1

•Email:varlamis@hua.gr
•Eclass
• http://eclass.hua.gr/courses/DIT161/

2
Reading

• Tan, Steinbach & Kumar. Introduction to data


mining, Addison-Wesley.
http://www-users.cs.umn.edu/~kumar/dmbook/index.php

• Jiawei Han, Micheline Kamber, and Jian Pei.


Data mining – Concepts and techniques (3rd
edition)
These presentations are based on the book and
on the presentations of Jiawei Han
http://www.cs.uiuc.edu/~hanj/

3
Course Outline

Organize and Manage Data


• 1st: Basic concepts of DM & BI. Solutions and Architectures. The
stages of DM process. Application examples.
• 2nd: The process of Data Mining. Classification techniques and
algorithms. Evaluation techniques and metrics.
• 3rd: Data preparation. Dimensionality reduction. Regression.
Core Data Mining Techniques
• 4th: Classification techniques. Practical application of classification
techniques (Lab)
...
• 9th: Clustering techniques and algorithms. Evaluation metrics.
Cluster description.
• 10th: Association rules extraction. Techniques and algorithms.
• 11th: Data warehouses. Data quality. Cubes and multidimensional
data analysis. Concept hierarchies and data projection in
dimensions.

4
Course Outline

Case studies
•5th: Introduction to Graph/Network Mining
•6th: Measuring networks and random graph model.
• 7th: A graph processing library
• 8th: Social Recommender systems
...
• 12th: Presentation of assignments

5
Grading system

• What is graded:

– Final written exam: 60%


– Group assignment: 40% [compulsory]
• Interim report
• Final presentation and documentation

• Written exams: with lecture notes and open


books

6
Definitions

7
Definition and concepts

• Business Intelligence (BI) refers to applications and


technologies accessing the appropriate data and
information in order to make the correct business decision
at the correct moment.

• Two types of BI Systems:


– Those that provide data analysis tools
• Multidimensional data analysis (or online analytical
processing)
• Data mining
• Decision support systems
– Those that provide information in structured format
• Dashboards

8
9
Multidimensional Data Analysis

• Multidimensional analysis provides users with an


excellent view of what is happening or what has
happened.

• Allows users to analyze data in such a way that


they can quickly answer business questions

• To accomplish this multidimensional analysis


tools allow users to “slice and dice” the data in
any desired way.

10
Data mining

• Searching for valuable business information in a


large database or data warehouse
• Data mining performs two basic operations:
– Predicting trends and behaviors
– Identifying previously unknown patterns and
relationships
• Data mining: The process that combines
techniques from statistics, artificial intelligence
and machine learning in order to process data
and extract implicit, non-obvious, interesting and
potentially useful knowledge that can support
decision making

11
Decision support systems

• Decision support systems


• DSS capabilities
– Sensitivity analysis
– What-if analysis
– Goal-seeking analysis

12
Digital Dashboards

• Dashboards:
– Provide rapid access to timely information.
– Provide direct access to management reports.
– Are very user-friendly and supported by graphics.

13
The management cockpit

• A strategic management room that enables top-level


decision makers to pilot their businesses better
• The environment encourages more efficient management
meetings and boosts team performance via effective
communication
• Key performance indicators and information relating to
critical success factors are displayed graphically on the
walls of the meeting room
• External information can be easily imported to the room to
allow competitive analysis

14
From Data to Knowledge

15
Evolution of sciences

• Before 1600, empirical science


• 1600-1950, theoretical science
– Dominated by theoretical models, which often motivate
experiments for better understanding the world
• 1950-1990, computational science
– We try to understand complex mathematical models through
simulation
• 1990-now, data science
– Abundance of data
– Ability to process huge data sets
– Data resources interconnection (through internet)
– The needs for collection, management, querying and
visualization of data increase to the volume of data

16
The Data Gap

• Usually information is hidden in data. This


information is not obvious
• Analysts need weeks to locate it through
Hypothesis testing
• Most of the data are never analysed
• We have the data! Now what?
…and we don’t know what to do with
them.

17
Why do we need data analysts

• Explosive data growth (from terabytes to


petabytes)
• Automated data collection
• Abundant sources
– Businesses : web, e-transactions, stock market
– Sciences: sensors, bioinformatics, scientific
experiments
– Society: news, digital cameras, social networks
“We are drowning in Data but starving for
Knowledge”. John Naisbit in “Megatrends”.
• We need automated analysis of large data sets

18
What else is data mining

• Knowledge Discovery in Databases


• Knowledge extraction
• Data and models analysis
• Data surveying over time and in varying degrees
of detail
• Collection of information and creation of business
intelligence

• NOT data search


• NOT query processing
• NOT a smart system that reacts to the rule base

19
Business intelligence (ΒΙ)

Source: http://decision-quality.com/ 20
Stages of BI creation

Collection Storage Analysis Delivery


“Business Intelligence-The Missing Link.” http://www.ittoolbox.com/peer/bi.pdf, Viewed 4/12/2006. 21
It is not simple
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

22
What is Business Intelligence?

1. Tables and charts


– Tables (scorecards) present the key performance indicators
(e.g. ROI)
– Charts (dashboards) present the performance in a condensed
and simple “dials and gauges” format

http://www.microstrategy.com/Solutions/5Styles/
23
What is Business Intelligence?

1. Tables and charts


2. Company reports
– Extended reports, adapted to the needs of each user
group

24
What is Business Intelligence?

1. Tables and charts


2. Company reports
3. Data analytics (OLAP: On-line Analytical Processing)
– Associates data subsets (e.g. temporal data,
customer data, income data) in a multidimensional
analysis (“cube” analysis)
– Selectively provides access to the initial (raw) data

25
What is Business Intelligence?

1. Tables and charts


2. Company reports
3. Data analytics (OLAP: On-line Analytical Processing)
4. Composite analysis and prediction
– Training and evaluation of data mining techniques to
past data.
– Application to new data and prognostics
(i.e.predictions on how data will evolve, what-if
scenarios, etc.)

26
What is Business Intelligence?

1. Tables and charts


2. Company reports
3. Data analytics (OLAP: On-line Analytical Processing)
4. Composite analysis and prediction
5. Alerts
– Automatic generation of reports and notifications for
troubles and opportunities

27
Market interest

• Income from BI software development

• Companies that
develop BI software

•http://apandre.wordpress.com/market/ 28
Scientific interest – Big Data

• Volume
– Scalable algorithms
• Variety
– Multidimensional data, e.g. microarray DNA data
contain a few 10K features,
– Spatial, spatiotemporal data, time-series data
– Web data, multimedia
– Graphs and hypergraphs in social networks
• Velocity
– Data streams, sensor data
• Veracity
– We are not always sure about data accuracy (e.g. GPS
data)
VALUE

29
Data mining for BI

30
Data mining steps

Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
31
Main principles

• Select useful data


• Determine the format of the extracted
knowledge (rules, predicted values,
groupings etc)
• Captures the domain knowledge (e.g. in
conceptual hierarchies or ontologies)
• Determine metrics for the evaluation of
found patterns (simplicity, certainty, utility,
innovation)
• Visualized (interactive, abstract, choose
depending on the knowledge format)
32
Example - Bank

• Business objective: Give residential loans that


can be paid back
• Existing knowledge:
– Clients with children studying use these loans to pay
tuition fees
– Customers with variable income use these loans to
offset their income
• Lots of data:
– Large stores that continuously collect data
from multiple active sources (data warehouse)

33
Sampling

• Choose a part of Customer Data that have


been granted a loan in the past
– Some paid it back
– Others not

34
Pattern mining

• Find the rules that predict whether a customer will


be able to pay the loan back
IF (Salary < 40k) and
(numChildren > 0) and
(ageChild1 > 18 and ageChild1 < 22)
THEN YES
• Group customers and describe each group with
its predominant characteristics
– Among the many groups that have no special meaning,
we find a group of customers that take a loan using their
payroll or savings account

35
Common architecture for DW and DM
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta Data

Filtering&Integration Database API Filtering


Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
36
Discussion

• Briefly describe the following, emphasizing


on the data they contain, their differences
and their possible applications:
– object-relational databases,
– spatial databases,
– text databases,
– multimedia databases,
– World Wide Web

37
Discussion

• Describe the following data mining


concepts
– association and correlation analysis,
classification, prediction, clustering, data
evolution analysis, characterization,
discrimination.
• What are the scientific challenges from
data mining in:
– Data streams, spatio-temporal data,
bio-infromatics?

38
Data mining tasks (1)

• Characterization and Discrimination


– Generalize, summarize, compare and contrast features of my
data, e.g. collecting data from meteorological sensors, how
can I distinguish dry from wet areas of the country?
• Association, correlation vs causation
– Correlation does not imply causation. When ice cream sales increase,
the number of drowning increases too
• Classification and prediction
– Models (functions) that describe and distinguish classes or concepts
for future prediction
e.g. prediction of unknown or missing values Δημιουργία μοντέλων
(συναρτήσεων) που περιγράφουν και διακρίνουν κατηγορίες ή έννοιες
για μελλοντική πρόβλεψη
– Classification of countries based on climate

39
Data mining tasks (2)

• Cluster analysis
– Grouped samples in new unknown groups , e.g. group homes
which are for sale and study the characteristics of groups
– Aim to maximize the similarity within groups and the diversity
between groups
• Outlier analysis
– Exceptional samples have completely different behavior from
all other samples
– It is noise , error or exception?
• Trend analysis
– Trends and variations: e.g. regression analysis (finding a
function that describes the data, find data that deviate far from
this)
– Mining sequential patterns : e.g. searching for cameras 🡪
searching for a memory card
– periodicity analysis
40
Discussion

• Give an example of data mining


usefulness in a business (or a sector in
general) that you are familiar with
• Describe the data, knowledge to be
produced, domain knowledge, evaluation
measures standards, visualizations

41
Market analysis

• Where are the data:


– transactions with credit cards, coupons, customer
complaints, market research
• Targeted advertising
– find groups of customers with common characteristics:
interests, income, buying habits
– Define buyers patterns (in time)
• Cross market analysis: correlated products
(diapers, beer), bundle sales
• Analysis of customer needs
– What are the best products for each group of customers
– What factors attract new customers

42

Vous aimerez peut-être aussi