Vous êtes sur la page 1sur 25

Introduction to Data Warehousing and Business Intelligence

Torben Bach Pedersen Aalborg University

Overview
Why Business Intelligence? Data analysis problems Data Warehouse (DW) introduction Analysis technologies that use the DW
OLAP Data mining Visualization A good DW is a prerequisite for using these technologies

Loosely covers [Jarke et al.] chapter 1


Plus some of my own observations 1.4 + 1.5 treated in more detail in a later lecture

Torben Bach Pedersen 2006 - DWML course

What is Business Intelligence?


Combination of technologies
Data Warehousing (DW) On-Line Analytical Processing (OLAP) Data Mining (DM) Data Visualization (VIS) Decision Analysis (what-if) Customer Relationship Management (CRM) Vertical solutions composed of the base technologies

Buzzword compliant (still ?)


Extension/integration of the technologies above

The opposite of Artificial Intelligence (AI)


AI systems make decisions for the users BI systems help the users make the right decisions, based available data Many BI techniques have roots in AI, though
Torben Bach Pedersen 2006 - DWML course 3

BI Is Important
Palo Alto Management Group: BI = $113 bio. in 2002 The Web makes BI more necessary
Customers do not appear physically in the store Customers can change to other stores more easily

Thus:
Know your customers using data and BI! Web logs makes is possible to analyze customer behavior in a more detailed than before (what was not bought?) Combine web data with traditional customer data

Next step is the Wireless Internet


Customers are always online Customers position is known Combine position and customer knowledge => very valuable!

Torben Bach Pedersen 2006 - DWML course

Data Analysis Problems


The same data found in many different systems
Example: customer data in 14 (now 23) systems! The same concept is defined differently (Nykredit)

Data is suited for operational systems (OLTP)


Accounting, billing, etc. Do not support analysis across business functions

Data quality is bad


Missing data, imprecise data, different use of systems

Data are volatile


Data deleted in operational systems (6 months) Data change over time no historical information

Torben Bach Pedersen 2006 - DWML course

Data Warehousing
Solution: new analysis environment (DW) where data are
Subject oriented (versus function oriented) Integrated (logically and physically) Stable (data not deleted, several versions ) Time variant (data can always be related to time) Supporting management decisions (different organization)

Data from the operational systems are


Extracted Cleansed Transformed Aggregated? Loaded into DW

Getting multidimensional data into the DW A good DW is a prerequisite for successful BI

Torben Bach Pedersen 2006 - DWML course

DW: Purpose and Definition


The purpose of a data warehouse is to support decision making Data is collected from a number of different sources
Finance, billing, web logs, personnel,

It is made easy to perform advanced analyses


Ad-hoc analyses and reports Data mining: identification of trends Management Information Systems

A data warehouse is a store of information organized in a unified data model.


Torben Bach Pedersen 2006 - DWML course 7

DW Architecture Data as Materialized Views


Existing databases and systems (OLTP)
Appl.

New databases and systems (OLAP) DM

DB
Appl.

OLAP

DB
Appl. Trans.

DM DW

Data mining

DB Global Data Warehouse DB

Appl.

DM Data Marts

Visualization

Appl.

DB

Torben Bach Pedersen 2006 - DWML course

OLTP vs. OLAP


On-Line Transaction Processing
Many, small queries Frequent updates The system is always available for both updates and reads Smaller data volume (few historical data) Complex data model (normalized)

On-Line Analytical Processing


Fewer, but bigger queries Frequent reads, in-frequent updates (daily) 2-phase operation: either reading or updating Larger data volumes (collection of historical data) Simple data model (multidimensional/de-normalized)

Torben Bach Pedersen 2006 - DWML course

Function- vs. Subject Orientation


Function-oriented systems
Appl.

Subject-oriented systems DM

DB
Appl.

D-Appl.

DB
Appl. Trans.

D-Appl.

DM DW

DB All subjects, integrated DB

Appl.

D-Appl.

DM Selected subjects

Appl.

DB

Torben Bach Pedersen 2006 - DWML course

10

n x m versus n + m
Appl. D-App

DB
Appl. Trans.

DM

DB
Appl.

DM DB
Trans.

D-App

Appl.

DM DB
Trans.

D-App

Appl.

DB

Torben Bach Pedersen 2006 - DWML course

11

Architecture Alternative
Appl. D-Appl.

DB
Appl.

Trans.

DM

DB
Appl.

DB

DM

D-Appl.

D-Appl. Appl. Trans.

DM DB DW

Appl.

DB DM
Torben Bach Pedersen 2006 - DWML course

D-Appl.

12

Top-down vs. Bottom-up


Appl. D-Appl.

DB
Appl.

DM

DB
Appl. Trans.

D-Appl.

DM DW

DB

Appl.

DB
Appl.

Top-down:DB 1. Design of DW 2. Design of DMs

In-between: 1. Design of DW for DM1 2. Design of DM2 and integration with DW 3. Design of DM3 and integration with DW 4. ...

D-Appl.

DM Bottom-up: 1. Design of DMs 2. Maybe integration of DMs in DW 3. Maybe no DW


13

Torben Bach Pedersen 2006 - DWML course

Datas Way To The DW


Extraction
Extract from many heterogeneous systems

Staging area
Large, sequential bulk operations => flat files best ?

Cleansing
Data checked for missing parts and erroneous values Default values provided and out-of-range values marked

Transformation
Data transformed to decision-oriented format Data from several sources merged, optimize for querying

Aggregation?
Are individual business transactions needed in the DW ?

Loading into DW
Large bulk loads rather than SQL INSERTs Fast indexing (and pre-aggregation) required
Torben Bach Pedersen 2006 - DWML course 14

Getting Multidimensional Data Out of DW


Large data volumes, e.g., sales, telephone calls
Giga-, Tera-, Peta-, Exa-byte

OLAP = On-Line Analytical Processing


Interactive analysis Explorative discovery Fast response times required

OLAP operations
Aggregation of data Standard aggregations operator, e.g., SUM Starting level, (Quarter, City) Roll Up: Less detail, Quarter->Year Drill Down: More detail, Quarter->Month Slice/Dice: Selection, Year=1999 Drill Across: Join
Torben Bach Pedersen 2006 - DWML course 15

Cube Example

Sales
350 300 250 Total 200 150 100 50 0 2000 Year 2001 Copenhagen Aalborg City Aalborg Copenhagen

Torben Bach Pedersen 2006 - DWML course

16

OLAP example
Millions of clicks
Still fast query response due to specialized DBMS technology

Torben Bach Pedersen 2006 - DWML course

17

OLAP applications
Reporting and querying Problem and opportunity analysis
I (and most) use Business Intelligence to mean more than this

Planning applications Specialized data mining / analysis projects

Torben Bach Pedersen 2006 - DWML course

18

DW Applications: Visualization
Graphical presentation of complex result Color, size, and form help to give a better overview

Torben Bach Pedersen 2006 - DWML course

19

DW Applications: Data Mining


Data mining is automatic knowledge discovery Roots in AI and statistics Classification
Partition data into pre-defined classes

Prediction
Predict/estimate unknown value based on similar cases

Clustering
Partition data into groups so the similarity within individual groups are greatest and the similarity between groups are smallest

Affinity grouping/associations
Find associations/dependencies between data that occur together Rules: A -> B (c%,s%): if A occurs, B occurs with confidence c and support s

Important to choose the granularity for mining


Too small granularity dont give any useful results (shirt brand,..)
Torben Bach Pedersen 2006 - DWML course 20

Data Mining Examples


Wal-Mart: USAs largest supermarket chain
Has DW with all ticket item sales for the last 5 years (big!) Use DW and mining intensively to gain business advantages Analysis of association within sales tickets
Discovery: Beer and diapers on the same ticket Men buy diapers, and must just have a beer Put the expensive beers next to the diapers Put beer at some distance from diapers with chips, videos in-between!

Wal-Mart's suppliers use the DW to optimize delivery


The supplier puts the product on the shelf The supplier only get paid when the product is sold

Web log mining


What is the association between time of day and requests? What user groups use my site? How many requests does my site get in a month? (Yahoo)

Torben Bach Pedersen 2006 - DWML course

21

Common DW Issues
Metadata management
Need to understand data = metadata needed Greater need that in OLTP applications as raw data is used Need to know about:
Data definitions, dataflow, transformations, versions, usage, security

DW project management
DW projects are large and different from ordinary SW projects
12-36 months and 1+ mio. US$ per project Data marts are smaller and safer (bottom up approach)

Reasons for failure


Lack of proper design methodologies High HW+SW cost (not so much anymore) Deployment problems (lack of training) Organizational change is hard (new processes, data ownership,..) Ethical issues (security, privacy,)
Torben Bach Pedersen 2006 - DWML course 22

Summary
Why Business Intelligence? Data analysis problems Data Warehouse (DW) introduction Analysis technologies that use the DW
OLAP Data mining Visualization

BI can provide many advantages to your organization


A good DW is a prerequisite for BI But, a DW is a means rather than a goalit is only when it is heavily used that success is achieved

Torben Bach Pedersen 2006 - DWML course

23

DWML Mini Project and Exam


Basis for discussion at the oral exam (20 mins per person)
Maximum 3 persons at a time in exam

Exam also covers literature


Not just mini project Questions in theoretical background, too

Performed in groups of ~4 persons


2 focus on DW tools, 2 on data mining tools Workload estimated based on this

Documented in report of ca. 20 pages Deadline: April 20


But every part should be done when indicated on home page

Challenge: privacy versus knowledge!!!


Are personal identifiers interesting and/or needed ?
Torben Bach Pedersen 2006 - DWML course 24

DWML Software
DW software
MS SQL Server 2005 RDBMS MS Analysis Services MS Integration Services MS Reporting Services

Data mining software


Presented by Manfred Jaeger

MS software via MSDNAA


Talk to junta/WAP about accounts

Groups to be formed during today !


Inform TBP about the groups at 12.00

Torben Bach Pedersen 2006 - DWML course

25