Vous êtes sur la page 1sur 80

UNIT I

Data Warehousing

Dr Harleen Kaur
Databases
 Databases are developed on the IDEA that
DATA is one of the critical materials of the
Information Age
 Information, which is created by data,
becomes the bases for decision making
Decision Support Systems
 Created to facilitate the decision making
process
 So much information that it is difficult to
extract it all from a traditional database
 Need for a more comprehensive data
storage facility
– Data Warehouse
Decision Support Systems
 Extract Information from data to use as the basis
for decision making
 Used at all levels of the Organization
 Tailored to specific business areas
 Interactive
 Ad Hoc queries to retrieve and display information
 Combines historical operation data with business
activities
4 Components of DSS
 Data Store – The DSS Database
– Business Data
– Business Model Data
– Internal and External Data
 Data Extraction and Filtering
– Extract and validate data from the operational
database and the external data sources
4 Components of DSS
 End-User Query Tool
– Create Queries that access either the
Operational or the DSS database
 End User Presentation Tools
– Organize and Present the Data
Differences with DSS
 Operational
– Stored in Normalized Relational Database
– Support transactions that represent daily
operations (Not Query Friendly)
 3 Main Differences
– Time Span
– Granularity
– Dimensionality
Time Span
 Operational
– Real Time
– Current Transactions
– Short Time Frame
– Specific Data Facts
 DSS
– Historic
– Long Time Frame (Months/Quarters/Years)
– Patterns
Granularity
 Operational
– Specific Transactions that occur at a given time
 DSS
– Shown at different levels of aggregation
– Different Summary Levels
– Decompose (drill down)
– Summarize (roll up)
Dimensionality
 Most distinguishing characteristic of DSS
data
 Operational
– Represents atomic transactions
 DSS
– Data is related in Many ways
– Develop the larger picture
– Multi-dimensional view of data
DSS Database Requirements
 DSS Database Scheme
– Support Complex and Non-Normalized data
 Summarized and Aggregate data
 Multiple Relationships
 Queries must extract multi-dimensional time slices
 Redundant Data
DSS Database Requirements
 Data Extraction and Filtering
– DSS databases are created mainly by extracting data
from operational databases combined with data
imported from external source
 Need for advanced data extraction & filtering tools
 Allow batch / scheduled data extraction
 Support different types of data sources
 Check for inconsistent data / data validation rules
 Support advanced data integration / data formatting conflicts
DSS Database Requirements
 End User Analytical Interface
– Must support advanced data modeling and data
presentation tools
– Data analysis tools
– Query generation
– Must Allow the User to Navigate through the DSS
 Size Requirements
– VERY Large – Terabytes
– Advanced Hardware (Multiple processors, multiple disk
arrays, etc.)
Inmons’s definition

A data warehouse is
-subject-oriented,
-integrated,
-time-variant,
-nonvolatile
collection of data in support of management’s
decision making process.
Data Warehouse
 DSS – friendly data repository for the DSS
is the DATA WAREHOUSE

 Definition: Integrated, Subject-Oriented,


Time-Variant, Nonvolatile database that
provides support for decision making
Subject-oriented

 Data warehouse is organized around subjects


such as sales,product,customer.
 It focuses on modeling and analysis of data
for decision makers.
 Excludes data not useful in decision support
process.
Integration
 Data Warehouse is constructed by integrating
multiple heterogeneous sources.
 Data Preprocessing are applied to ensure
consistency.
RDBMS

Data
Legacy Warehouse
System

Flat File Data Processing


Data Transformation
Integrated
 The data warehouse is a centralized,
consolidated database that integrated data
derived from the entire organization
– Multiple Sources
– Diverse Sources
– Diverse Formats
Integration
 In terms of data.
– encoding structures.

– Measurement of
attributes.

– physical attribute.
of data remarks

– naming conventions.

– Data type format


Time-variant

 Provides information from historical


perspective e.g. past 5-10 years
 Every key structure contains either implicitly
or explicitly an element of time
Time-Variant
 The Data Warehouse represents the flow of
data through time
 Can contain projected data from statistical
models
 Data is periodically uploaded then time-
dependent data is recomputed
Nonvolatile

 Data once recorded cannot be updated.


 Data warehouse requires two operations in
data accessing
– Initial loading of data
– Access of data

load

access
Nonvolatile
 Once data is entered it is NEVER removed
 Represents the company’s entire history
– Near term history is continually added to it
– Always growing
– Must support terabyte databases and
multiprocessors
 Read-Only database for data analysis and
query processing
Need for Data Warehousing

 Industry has huge amount of operational data


 Knowledge worker wants to turn this data into
useful information.
 This information is used by them to support
strategic decision making .
Need for Data Warehousing (contd..)

 It is a platform for consolidated historical data for


analysis.
 It stores data of good quality so that knowledge
worker can make correct decisions.
Need for Data Warehousing (contd..)

 From business perspective


-it is latest marketing weapon
-helps to keep customers by learning more about
their needs .
-valuable tool in today’s competitive fast evolving
world.
Data Warehousing Tools

 Data Warehouse
– SQL Server 2000 DTS
– Oracle 8i Warehouse Builder
 OLAP tools
– SQL Server Analysis Services
– Oracle Express Server
 Reporting tools
– MS Excel Pivot Chart
– VB Applications
Data Warehouse Implementation
 An Active Decision Support Framework
– Not a Static Database
– Always a Work in Process
– Complete Infrastructure for Company-Wide
decision support
– Hardware / Software / People / Procedures /
Data
– Data Warehouse is a critical component of the
Modern DSS – But not the Only critical
component
Operational v/s Information System
Features Operational Information
Characteristics Operational processing Informational processing

Orientation Transaction Analysis


User Clerk,DBA,database Knowledge workers
professional
Function Day to day operation Decision support
Data Current Historical
View Detailed,flat relational Summarized,
multidimensional
DB design Application oriented Subject oriented
Unit of work Short ,simple Complex query
transaction
Access Read/write Mostly read
Operational v/s Information System

Features Operational Information


Focus Data in Information out
Number of records tens millions
accessed
Number of users thousands hundreds
DB size 100MB to GB 100 GB to TB
Priority High performance,high High flexibility,end-
availability user autonomy

Metric Transaction Query througput


throughput
Data Marts
 Small Data Stores
 More manageable data sets
 Targeted to meet the needs of small groups
within the organization
 Small, Single-Subject data warehouse
subset that provides decision support to a
small group of people
OLAP
 Online Analytical Processing Tools
 DSS tools that use multidimensional data
analysis techniques
– Support for a DSS data store
– Data extraction and integration filter
– Specialized presentation interface
12 Rules of a Data Warehouse
 Data Warehouse and Operational
Environments are Separated
 Data is integrated
 Contains historical data over a long period
of time
 Data is a snapshot data captured at a given
point in time
 Data is subject-oriented
12 Rules of Data Warehouse
 Mainly read-only with periodic batch
updates
 Development Life Cycle has a data driven
approach versus the traditional process-
driven approach
 Data contains several levels of detail
– Current, Old, Lightly Summarized, Highly
Summarized
12 Rules of Data Warehouse
 Environment is characterized by Read-only
transactions to very large data sets
 System that traces data sources, transformations,
and storage
 Metadata is a critical component
– Source, transformation, integration, storage,
relationships, history, etc
 Contains a chargeback mechanism for resource
usage that enforces optimal use of data by end
users
Multidimensional Data Analysis
Techniques
 Advanced Data Presentation Functions
– 3-D graphics, Pivot Tables, Crosstabs, etc.
– Compatible with Spreadsheets & Statistical
packages
– Advanced data aggregations, consolidation and
classification across time dimensions
– Advanced computational functions
– Advanced data modeling functions
UNIT II

Data Warehousing Architecture

Dr Harleen Kaur
 Data warehouses and their architectures vary
depending upon the specifics of an organization's
situation. Three common architectures are:
 Data Warehouse Architecture (Basic)
 Data Warehouse Architecture (with a Staging Area)

 Data Warehouse Architecture (with a Staging Area a


Data Warehouse Architecture
(Basic)
 Figure shows a simple architecture for a
data warehouse. End users directly access
data derived from several source systems
through the data warehouse.

 Summaries are very valuable in data
warehouses because they pre-compute
long operations in advance. For example, a
typical data warehouse query is to retrieve
something like August sales.
 A summary in Oracle is called a
materialized view.
Data Warehouse Architecture
(with a Staging Area)
 In Figure, you need to clean and process
your operational data before putting it into
the warehouse. You can do this
programmatically, although most data
warehouses use a staging area instead.
 A staging area simplifies building
summaries and general warehouse
management. Figure illustrates this typical
architecture.
Architecture of a Data Warehouse
with a Staging Area

Data Warehouse Architecture
(with a Staging Area and Data
Marts)
 Although the architecture in Figure is quite
common, you may want to customize your
warehouse's architecture for different groups
within your organization. You can do this by
adding data marts, which are systems designed
for a particular line of business.
 Example, where purchasing, sales, and
inventories are separated. In this example, a
financial analyst might want to analyze historical
data for purchases and sales.
Data Warehousing Architecture
Monitoring &
Administration OLAP Servers
Metadata
Repository

Reconciled data Analysis


External Extract
Sources
Transform
Serve
Load
Refresh Query/Reporting
Operational
Dbs

Data Mining

DATA SOURCES TOOLS

DATA MARTS
Data Warehouse Architecture
 Data Warehouse server
– almost always a relational DBMS,rarely flat
files
 OLAP servers
– to support and operate on multi-
dimensional data structures
 Clients
– Query and reporting tools
– Analysis tools
– Data mining tools
Data Warehousing Schemas

 A schema is a collection of database


objects, including tables, views,
indexes, and synonyms.
 You can arrange schema objects in the
schema models designed for data
warehousing in a variety of ways.
Multidimensional Data Schema
Support
 Decision Support Data tends to be
– Nonnormalized
– Duplicated
– Preaggregated
 Star Schema
– Special Design technique for multidimensional
data representations
– Optimize data query operations instead of data
update operations
Data Warehouse Schema
 Star Schema
 Fact Constellation Schema
 Snowflake Schema
Star Schema
 The star schema is the simplest data warehouse schema.
 A single,large and central fact table and one table for each
dimension.
 Every fact points to one tuple in each of the dimensions
and has additional attributes.
 Does not capture hierarchies directly.
 It is called a star schema because the diagram resembles
a star, with points radiating from a center. The center of the
star consists of one or more fact tables and the points of
the star are the dimension tables,
 The most natural way to model a data
warehouse is as a star schema, only one
join establishes the relationship between the
fact table and any one of the dimension
tables.
 A star schema optimizes performance by
keeping queries simple and providing fast
response time. All the information about
each level is stored in one row.
Star Schema


common example of a sales fact table and dimension tables

customers, products, promotions, times, and channels.


Star Schema
 4 Components
– Facts
– Dimensions
– Attributes
– Attribute Hierarchies
Facts
 Numeric measurements (values) that represent a
specific business aspect or activity
 Stored in a fact table at the center of the star
scheme
 Contains facts that are linked through their
dimensions
 Can be computed or derived at run time
 Updated periodically with data from operational
databases
Dimensions
 Qualifying characteristics that provide
additional perspectives to a given fact
– DSS data is almost always viewed in relation to
other data
 Dimensions are normally stored in
dimension tables
Attributes
 Dimension Tables contain Attributes
 Attributes are used to search, filter, or classify
facts
 Dimensions provide descriptive characteristics
about the facts through their attributed
 Must define common business attributes that will
be used to narrow a search, group information, or
describe dimensions. (ex.: Time / Location /
Product)
 No mathematical limit to the number of dimensions
(3-D makes it easy to model)
Attribute Hierarchies
 Provides a Top-Down data organization
– Aggregation
– Drill-down / Roll-Up data analysis
 Attributes from different dimensions can be
grouped to form a hierarchy
Star Schema for Sales
Dimension
Tables

Fact Table
Star Schema Representation
 Fact and Dimensions are represented by physical
tables in the data warehouse database
 Fact tables are related to each dimension table in a
Many to One relationship (Primary/Foreign Key
Relationships)
 Fact Table is related to many dimension tables
– The primary key of the fact table is a composite primary key
from the dimension tables
 Each fact table is designed to answer a specific DSS
question
Star Schema
 The fact table is always the larges table in
the star schema
 Each dimension record is related to
thousand of fact records
 Star Schema facilitated data retrieval
functions
 DBMS first searches the Dimension Tables
before the larger fact table
Star Schema (contd..)
Fact Table
Store Store Key Time Dimension
Dimension
Store Key Product Key Period Key
Store Name Period Key Year
City Quarter
Units
State Price Month
Region

Product Key
Product Desc

Product
Dimension

Benefits: Easy to understand, easy to define hierarchies,


reduces no. of physical joins.
SnowFlake Schema
 Variant of star schema model.
 A single,large and central fact table and one
or more tables for each dimension.
 Dimension tables are normalized i.e. split
dimension table data into additional tables
SnowFlake Schema (contd..)
Store Dimension Fact Table Time Dimension
Store Key Period Key
Store Key
Product Key Year
Store Name
Period Key Quarter
City Key
Units Month
Price
City Dimension
City Key
Product Key
City
Product Desc
State
Region Product
Dimension
Drawbacks: Time consuming joins,report generation slow
Fact Constellation

 Multiple fact tables share dimension tables.


 This schema is viewed as collection of stars
hence called galaxy schema or fact
constellation.
 Sophisticated application requires such
schema.
Fact Constellation (contd..)
Sales Shipping
Fact Table Fact Table
Store Key Product
Shipper Key
Dimension
Product Key Product Key Store Key
Period Key Product Desc Product Key
Units
Period Key
Price
Units
Price
Store
Dimension
Store Key
Store Name
City
State
Region
Building Data Warehouse
 Data Selection
 Data Preprocessing
– Fill missing values
– Remove inconsistency
 Data Transformation & Integration
 Data Loading
Data in warehouse is stored in form of fact tables
and dimension tables.
Case Study
 Afco Foods & Beverages is a new company
which produces dairy,bread and meat products
with production unit located at Baroda.
 There products are sold in North,North West
and Western region of India.
 They have sales units at Mumbai, Pune ,
Ahemdabad ,Delhi and Baroda.
 The President of the company wants sales
information.
Sales Information

Report: The number of units sold.

113

Report: The number of units sold over time

January February March April


14 41 33 25
Sales Information
Report : The number of items sold for each product with
time
Jan Feb Mar Apr

Wheat Bread 6 17

Cheese 6 16 6 8

Time
Swiss Rolls 8 25 21

Product
Sales Information
Report: The number of items sold in each City for each
product with time Jan Feb Mar Apr
Mumbai Wheat Bread 3 10
City
Cheese 3 16 6
Swiss Rolls 4 16 6

Time
Pune Wheat Bread 3 7

Cheese 3 8 Product

Swiss Rolls 4 9 15
Sales Information

Report: The number of items sold and income in each region for
each product with time.
Jan Feb Mar Apr
Rs U Rs U Rs U Rs U
Mumbai Wheat Bread 7.44 3 24.80 10
Cheese 7.95 3 42.40 16 15.90 6

Swiss Rolls 7.32 4 29.98 16 10.98 6


Pune Wheat Bread 7.44 3 17.36 7
Cheese 7.95 3 21.20 8
Swiss Rolls 7.32 4 16.47 9 27.45 15
Sales Measures & Dimensions

 Measure – Units sold, Amount.


 Dimensions – Product,Time,Region.
Sales Data Warehouse Model
Fact Table
City Product Month Units Rupees
Mumbai Wheat Bread January 3 7.95
Mumbai Cheese January 4 7.32
Pune Wheat Bread January 3 7.95
Pune Cheese January 4 7.32
Mumbai Swiss Rolls February 16 42.40
Sales Data Warehouse Model

City_ID Prod_ID Month Units Rupees

1 589 1/1/1998 3 7.95


1 1218 1/1/1998 4 7.32
2 589 1/1/1998 3 7.95
2 1218 1/1/1998 4 7.32
1 589 2/1/1998 16 42.40
Sales Data Warehouse Model

Product Dimension Tables


Prod_ID Product_Name Product_Category_ID
589 Wheat Bread 1
590 White Bread 1

288 Coconut Cookies 2

Product_Category_Id Product_Category

1 Bread
2 Cookies
Sales Data Warehouse Model

Region Dimension Table

City_ID City Region Country

1 Mumbai West India


2 Pune NorthWest India
Sales Data Warehouse Model

Time

Product
Sales Fact Product
Category

Region
THANK YOU
Q/A

Vous aimerez peut-être aussi