Vous êtes sur la page 1sur 58

By

Data Warehouse Concepts


Mr. Umar Frauq
&
17th November 2012
Mr. C. Divakar

One Day workshop on


Design and Architecture of ETL
Agenda

Introduction to Data Warehouse

Operational Systems

Why a separate Data Warehouse ?

Data Warehouse Architecture

Data Marts

Operational Data Store

Data Warehouse Building Approach

Dimensional Modeling

Database Design Methodology for DWH

On-line Analytical Processing (OLAP)


2
What is Data Warehouse ?
A data warehouse is a
decision support database that is maintained separately from the
organizations operational database.
support information processing by providing a platform of
consolidated, historical data for analysis.
It is a concept and not a tool

Data warehousing:
The process of constructing and using data warehouses.

A repository providing access to integrated enterprise-wide data for


management and decision support analysts. The information is subject
orientated, recorded over time and may be stored at various degrees of
summarization

3
Evolution of Data Warehousing

Since 1970s, organizations have mostly focused new investment in new


computer systems that automate business processes.
In recent times, Organizations are focusing on ways to use operational
data to support decision-making, as means of gaining a competitive edge.
However, operational systems were never designed to support such
business activities.
Using these systems for decision making may never be an easy solution.
The concept of a Data Warehouse was deemed the solution to meet the
requirements of a system capable of supporting decision making,
receiving data from multiple operational data sources.

4
Benefits of a Data Warehouse
High ROI
Increased Productivity
High implementation cost (anywhere
Data Warehousing improves the
from $50,000 to over $10 million due to productivity of corporate decision-makers
a variety of solutions available) by creating and integrated database of
Avg. 3 yr return (ROI) reached 401% consistent, subject oriented , historical data.
DWH
(study conducted by International Data BENEFITS It integrates data from multiple
Corporation (IDC) in 1996 incompatible systems into a form that
provides one consistent view of the
90% of companies achieved organization.
over 40% ROI. By transforming data into meaningful
Half of the companies information, a data warehouse allows
achieved over 160% ROI. corporate decision makers to perform more
substantive, accurate and consistent
A quarter achieved more than Competitive Advantage
analysis.
600% ROI. Huge returns of investment for companies
who have successfully implemented a data
warehouse is evidence of the enormous
competitive advantage that accompanies this
technology.
Competitive advantage is gained by allowing
decision-makers access to data that can reveal
previously unavailable, unknown and untapped
information on, for example, customers, trends
and demands.

5
Data Warehouse Definition (as defined by the Data Warehouse gurus)

A subject-oriented, integrated, time-variant and non-volatile


collection of data in support of management's decision making
process - W.H. Inmon

A copy of transaction data, specifically structured for query


and analysis - Ralph Kimball

6
Characteristics of a Data Warehouse
Subject Oriented Integrated
Organized around major subjects, such as customer, Integrating of multiple, heterogeneous data sources
product, sales. relational databases, flat files, on-line transaction
records.
Focusing on the modeling and analysis of data for
Data cleaning and data integration techniques are
decision makers, not on daily operations or applied.
transaction processing. Ensure consistency in naming conventions, encoding
Provide a simple and concise view around particular structures, attribute measures, etc. among different
subject issues by excluding data that are not useful in data sources. e.g., Hotel price: currency, tax,
the decision support process. breakfast covered, etc.
When data is moved to the warehouse, it is
converted.

Time Variant Non-Volatile


The time horizon for the data warehouse is A physically separate store of data transformed from
significantly longer than that of operational systems. the operational environment.
Operational update of data does not occur in the data
Operational database: current value data. warehouse environment.

Data warehouse data: provide information Does not require transaction processing,
from a historical perspective (e.g., past 5-10 recovery, and concurrency control mechanisms
years) Requires only two operations in data
accessing: initial loading of data and access of
data.

7
Applications of Data Warehouse
Three kinds of data warehouse applications

Information processing
supports querying, basic statistical analysis, and reporting
using cross-tabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.

8
Agenda

Introduction to Data Warehouse

Operational Systems

Why a separate Data Warehouse

Data Warehouse Architecture

Data Marts

Operational Data Store

Data Warehouse Building Approach

Dimensional Modeling

Database Design Methodology for DWH

On-line Analytical Processing (OLAP)


9
Operational Systems (OLTP)
They are systems which help us run day-to-day enterprise operations
They are the back-bone systems of any enterprise
Operational systems create high volume of transactions, such as sales,
purchases, deposits, withdrawals, returns, refunds, phone calls, toll
roads, web site hits, etc
Transactions are the base level of data the raw material for
understanding customer behavior
Unfortunately, operational systems change due to changing business
needs
Fortunately, operational systems can usually be changed to support
changing business needs
Data warehousing strategies need to be aware of operational system
changes

10
Operational Systems (OLTP) Vs Data Warehouse (OLAP)
Operational Systems Data Warehouse Systems

Holds current data. Holds historical data.


Stores detailed data. Stores detailed, lightly, and highly
Data is dynamic. summarized data.

Repetitive processing. Data is largely static.

High level of transaction throughput. Ad Hoc, unstructured, and heuristic


processing.
Predictable pattern of data usage.
Medium to low level of transaction
Transaction driven.
throughput.
Application oriented.
Unpredictable pattern of usage.
Supports day to day decisions.
Analysis driven.
Serves large number of clerical/operation
Subject oriented.
users.
Supports strategic decisions.
Serves relatively low number of
managerial users.

11
Agenda

Introduction to Data Warehouse

Operational Systems

Why a separate Data Warehouse ?

Data Warehouse Architecture

Data Marts

Operational Data Store

Data Warehouse Building Approach

Dimensional Modeling

Database Design Methodology for DWH

12
Why a separate Warehouse ?
Performance Functions
Special data organisation, access Missing Data: Decision support
methods, and implementation methods requires historical data which
are needed to support operational DBs do not typically
multidimensional views and operations maintain
typical of OLAP Data Consolidation: DS requires
Complex OLAP queries degrade consolidation (aggregation,
performance for operational summarization) of data from
transactions heterogeneous sources: operational
Concurrency control and recovery DBs, external sources
modes of OLTP are not compatible Data Quality: different sources typically
with OLAP analysis use inconsistent data representations,
codes and formats which have to be
reconciled.

13
Advantages of a Data Warehouse

High query performance


Queries not visible outside warehouse
Local processing separate from Operational Systems
No dependency on source for availability of data
Maintains information at various granular levels
Has historical information

14
Agenda

Introduction to Data Warehouse

Operational Systems

Why a separate Data Warehouse ?

Data Warehouse Architecture

Data Marts

Operational Data Store

Data Warehouse Building Approach

Dimensional Modeling

Database Design Methodology for DWH

On-line Analytical Processing (OLAP)


15
Data Warehouse Architecture

16
Data Warehouse Architectural Components

Operational Data:
Source of Data for the Data Warehouse.

Operational Data Store:


A repository of current and integrated Operational Data used for
analysis.

Load Manager:
Performs all the operations associated with the extraction and
loading of data into the warehouse.

17
Data Warehouse Architectural Components

Warehouse Manager:

Performs operations such as

Analysis of data to ensure consistency.


Transformation and merging of source data from temporary
storage into data warehouse tables.
Creation of indexes and views on base tables.
Generation of denormalization.
Generation of aggregations.
Backing-up and archiving data.

18
Data Warehouse Architectural Components
Query Manager
Performs all operations associated with the management of user
queries.
Examples,
Directing queries to the appropriate tables.
Scheduling the execution of queries.
Detailed Data
The area that stores all the detailed data in the database schema.
Not available online but made available by aggregating data to the
next level of detail.
On a regular basis, detailed data is added to the warehouse to
supplement the aggregated data.

19
Data Warehouse Architectural Components

Lightly And Highly Summarized Data

Stores all the predefined lightly and highly defined summarized data
generated by the warehouse manager.
This area of warehouse is transient in order to respond to changing
query profiles.
Used to speed up performance of queries.
Continuously updated as new data is loaded into the warehouse.

20
Data Warehouse Architectural Components

Archive/Backup Data
Area of Data warehouse stores detailed and summarized data for
the purposes of archiving and backup.
Data is transferred to storage devices such as magnetic tape or
optical disk.

Metadata
Area stores all the metadata (data about data) definitions used by
all the processes in the warehouse.
Used for a variety of purposes
Extraction and loading processes.
Warehouse Management process.
As part of the query management process.

21
Data Warehouse Architectural Components

End-User Access Tools


Users interact with the warehouse using end-use tools. For
example,
Reporting and query tools.
Application development tools.
Executive Information System (EIS) tools.
Online Analytical Processing (OLAP) tools.
Data mining tools.

22
Agenda

Introduction to Data Warehouse

Operational Systems

Why a separate Data Warehouse ?

Data Warehouse Architecture

Data Marts

Operational Data Store

Data Warehouse Building Approach

Dimensional Modeling

Database Design Methodology for DWH

On-line Analytical Processing (OLAP)


23
Data Marts

Decentralized subset of data found either in a data warehouse or as a


standalone subset designed to support the unique business unit
requirements of a specific decision-support system
A subset of Data Warehouse that supports the requirements of a
particular department or business function.
It is a subset of data warehouse that is designed for a particular line of
business, such as sales, marketing, or finance.
In a dependent data mart, data can be derived from an enterprise-
wide data warehouse.
In an independent data mart, data can be collected directly from
sources.

24
Data Marts - Advantages

Single subject area


Limited source systems (Reduced data volume)
Complex to maintain
Focused on user needs
Limited scope
Optimum model for DW construction
Demonstrates ROI
Allows prototyping

25
Data Marts - Disadvantages

Does not provide overall view of business

Data redundancy / repetition due to multiple data marts

Complex maintenance

Scalability issues

26
Data Warehouse Vs Data Marts

Data Warehouse Data Marts

Focuses only on the requirements


Focuses on the entire enterprise
of users associated with one
department/business function

Contains detailed Operational Data Contains only summary information

More complex and tough to


Easily understood and navigated
navigate

27
Agenda

Introduction to Data Warehouse

Operational Systems

Why a separate Data Warehouse ?

Data Warehouse Architecture

Data Marts

Operational Data Store

Data Warehouse Building Approach

Dimensional Modeling

Database Design Methodology for DWH

On-line Analytical Processing (OLAP)


28
Operational Data Store (ODS)

It is a type of database often used as an interim area for building a data


warehouse.
Contents of the ODS are updated through the course of business
operations.
Designed to quickly perform relatively simple queries on small amounts
of data (such as finding the status of a customer order), rather than the
complex queries on large amounts of data typical of the data
warehouse.
An ODS is similar to your short term memory in that it stores only very
recent information; in comparison, the data warehouse is more like long
term memory in that it stores relatively permanent information.

29
Agenda

Introduction to Data Warehouse

Operational Systems

Why a separate Data Warehouse ?

Data Warehouse Architecture

Data Marts

Operational Data Store

Data Warehouse Building Approach

Dimensional Modeling

Database Design Methodology for DWH

On-line Analytical Processing (OLAP)


30
Approaches to building a Data Warehouse

Top Down
Bottom - UP

31
Agenda

Introduction to Data Warehouse

Operational Systems

Why a separate Data Warehouse

Data Warehouse Architecture

Data Marts

Operational Data Store

Data Warehouse Building Approach

Dimensional Modeling

Database Design Methodology for DWH

On-line Analytical Processing (OLAP)


32
Essential Components of a Data Warehouse
FACT Tables DIMENSION Tables
They are Key Performance Indicators Dimension table is the set of tables which
(KPI) of an enterprise provide a perspective to the data
Data in the FACT tables is usually maintained in the FACT tables
numeric They are the WHEN, WHAT, WHERE
Central table in a data warehouse schema qualifiers to the measures
that contains numerical measures and They represent the subject area of the
keys relating facts to dimension tables. Data Warehouse
They contain data that describes specific Holds descriptive information about a
events within a business, such as bank particular business perspective
transactions or product sales Contains relatively static data
It is a table that contains the information Are joined to a fact table through a foreign
(measures) that the business users wish key reference
to analyze to find new trends or to
understand the success or failure of the
organization
It consists of the measurements, metrics
or facts of a business process. It is often
located at the centre of a star schema,
surrounded by dimension tables.

33
Dimensional Modeling

A logical design technique that aims to present the data in a standard,


intuitive form that allows for high-performance access.
Data Mart uses the concept of ER modeling with some restrictions.
Every Data Mart is composed of one table called Fact Table, with a
composite primary key.
Additionally, it contains a set of smaller tables called Dimension Tables,
each consisting of a simple (non- composite) primary key

34
Features of Dimensional Modeling
Efficiency: The consistency of the underlying database structure allows
more efficient access to the data by various tools like report writers and
query tools

Ability to handle changing environments : This design is better able


to support ad hoc user queries

Extensibility : This model is extensile as far as adding new facts,


dimensions , attributes and breaking existing dimension records are
concerned

Ability to model common business situations : Using report writers,


query tools and other user interfaces, every situation has a well-
understood set of alternatives

Predictable query processing : Even though the overall data model in


the enterprise is complex, the query processing is very predictable as
every fact table has to queried independently

35
Dimensional Modeling Techniques Star Schema
Star Schema: A single object (fact table) in the middle connected to a
number of dimension tables

sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt

store
storeId
city

36
Dimensional Modeling Techniques Snowflake Schema
Snowflake Schema: A refinement of star schema where the dimensional
hierarchy is represented explicitly by normalizing the dimension tables

Sales Fact Table Product


Date Date ProductNo
ProdName
Date Product
ProdDesc
Month Month Store Category
Month QOH
Store Customer
Year
Year StoreID unit_sales Cust
City City
Year dollar_sales CustId
City CustName
schilling_sales
State State CustCity
State CustCountry
Country
Country
Measurements
Country
Region
37
Agenda

Introduction to Data Warehouse

Operational Systems

Why a separate Data Warehouse

Data Warehouse Architecture

Data Marts

Operational Data Store

Data Warehouse Building Approach

Dimensional Modeling

Database Design Methodology for DWH

On-line Analytical Processing


38
Nine-Step Design Methodology (by Ralph Kimball)

Choosing the process


Choosing the grain
Identifying and conforming the dimensions
Choosing the facts
Storing pre-calculations in the fact table
Rounding out the dimension tables
Choosing the duration of the database
Tracking slowly changing dimensions
Deciding the query priorities and the query modes

39
Step1 : Choosing the process

The process (function) refers to the subject matter of a particular


data marts. The first data mart to be built should be the one that is
most likely to be delivered on time, within budget, and to answer the
most commercially important business questions.
The best choice for the first data mart tends to be the one that is
related to sales

40
Step2 : Choosing the grain

Choosing the grain means deciding exactly what a fact table


record represents.
For example, the entity Sales may represent the facts about each
property sale. Therefore, the grain of the Property_Sales fact table
is individual property sale
Choosing the grain of the fact table helps in identifying the
dimensions of the table
The grain decision for the fact table also determines the grain of
each of the dimension tables.
For example, if the grain for the Property_Sales is an individual
property sale, then the grain of the Client dimension is the detail of
the client who bought a particular property

41
Step3 : Identifying the conforming dimensions

Dimensions set the context for formulating queries about the facts in
the fact table
We identify dimensions in sufficient detail to describe things such as
clients and properties at the correct grain
If any dimension occurs in two data marts, they must be exactly the
same dimension, or one must be a subset of the other (this is the only
way that two Data Marts share one or more dimensions in the same
application)
When a dimension is used in more than one Data Mart, the dimension
is referred to as being conformed.

42
Step 4 : Choosing the facts

The grain of the fact table determines which facts can be used in the
data mart all facts must be expressed at the level implied by the grain
In other words, if the grain of the fact table is an individual property
sale, then all the numerical facts must refer to this particular sale (the
facts should be numeric and additive)

43
FACT Table Types Additive FACT Tables

GEOGRAPHY_DIM TIME_DIM
GEOGRAPHY_KEY DATE_KEY

SALES_FACT

GEOGRAPHY_KEY

PRODUCT_KEY

DATE_KEY

PRODUCT_DIM CUSTOMER_KEY CUSTOMER_DIM


PRODUCT_KEY CUSTOMER_KEY
QUANTITY_SOLD

PRICE

Additive FACT Columns

44
FACT Table Types Semi-Additive FACT Tables

GEOGRAPHY_DIM TIME_DIM
GEOGRAPHY_KEY DATE_KEY

INVENTORY_BALANCE

GEOGRAPHY_KEY

PRODUCT_KEY

DATE_KEY

PRODUCT_DIM CUSTOMER_KEY CUSTOMER_DIM


PRODUCT_KEY CUSTOMER_KEY
QUANTITY_BALANCE

Non-Additive FACT Column

45
FACT Table Types Factless FACT Table

GEOGRAPHY_DIM TIME_DIM
GEOGRAPHY_KEY DATE_KEY

PRODUCT_SALES_FACT
GEOGRAPHY_KEY

PRODUCT_KEY

DATE_KEY

PRODUCT_DIM CUSTOMER_KEY CUSTOMER_DIM


PRODUCT_KEY CUSTOMER_KEY

46
Step 5 : Storing pre-calculations in FACT tables

Once the facts have been selected each should be re-examined to


determine whether there are opportunities to use pre-calculations
For e.g. profit or loss %
These are useful facts, since they are additive quantities, from which
we can derive valuable information
Important for a value that is fundamental to an enterprise, or if there is
any chance of a user calculating the value incorrectly

47
Step 6 : Rounding out the dimension tables

The step involves adding as many text / descriptive columns to the


dimension table as possible
Text descriptions should be intuitive and understandable to the users
as possible

48
Step 7 : Choosing the duration of database

Decides the time-period for which records are to be maintained in the


table
For e.g., for some companies (e.g. insurance companies) there may be
a legal requirement to retain data extending back five or more years.
Very large fact tables raise at least two very significant data
warehouse design issues:
The older data, the more likely there will be problems in reading
and interpreting the old files
It is mandatory that the old versions of the important dimensions
be used, not the most current versions

49
Step 8 : Tracking slowly changing dimensions

The changing dimension problem means that the proper description of


the old client and the old branch must be used with the old data
warehouse schema
Usually, the data warehouse must assign a generalized key to these
important dimensions in order to distinguish multiple snapshots of
clients and branches over a period of time
There are different types of changes in dimensions:
A dimension attribute is overwritten
A dimension attribute causes a new dimension record to be
created etc.

50
Three types of Slowly Changing Dimensions (SCD)

Type 1: Where a changed dimension attribute is overwritten


Type 2: Where a changed dimension attribute causes a new
dimension record to be created
Type 3: Where a changed dimension attribute causes an alternate
attribute to be created so that both the old and new values of the
attribute are simultaneously accessible (in the same dimension)

51
Step 9 : Deciding the query priorities and query modes

In this step we consider physical design issues


The presence of pre-stored summaries and aggregates
Indices
Materialized views
Security issue
Backup issue
Archive issue

52
Agenda

Introduction to Data Warehouse

Operational Systems

Why a separate Data Warehouse

Data Warehouse Architecture

Data Marts

Operational Data Store

Data Warehouse Building Approach

Dimensional Modeling

Database Design Methodology for DWH

On-line Analytical Processing (OLAP)


53
On-line Analytical Processing (OLAP)

The dynamic synthesis, analysis, and consolidation of large volumes of multi-


dimensional data
multi-dimensional view of aggregate data
to provide quick access to strategic info for advanced analysis.
users gain deeper understanding about their corporate data through fast,
consistent, interactive access to wide variety of views of data.
can easily answer who? and what? questions,
ability to answer what if? and why? type questions distinguishes
OLAP from general-purpose query tools.
Types of analysis include:
basic navigation/browsing
calculations
more complex analyses such as time series and complex modeling.

54
OLAP Server Architectures

Relational OLAP (ROLAP)


Extended relational DBMS that maps operations on
multidimensional data to standard relations operations
Store all information, including fact tables, as relations

Multidimensional OLAP (MOLAP)


Special purpose server that directly implements multidimensional
data and operations
Store multidimensional datasets as arrays
Fast indexing to pre-summarized data

Hybrid OLAP (HOLAP)


Give users/system administrators freedom to select different
partitions

55
Typical OLAPS Operations

Roll up (drill-up): summarize data


Climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up


From higher level summary to lower level summary or detailed
data, or introducing new dimensions

Slice and dice:


Project and select

Pivot (rotate):
Reorient the cube, visualization, 3D to series of 2D planes.

56
Multi-Dimensional Analysis
Sales volume as a function of product, month, and region

Total annual sales


of TV in U.S.A.
Date
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
sum
Canada

Country
Mexico

sum

57
Benefits of OLAP

Increased productivity of end-users.


Retention of organizational control over the integrity of corporate data.
Reduces query drag and network traffic on OLTP systems or on the
data warehouse.
Improved potential revenue and profitability.
Maybe reduced backlog of applications development for IT staff.

58

Vous aimerez peut-être aussi