Data Warehouse Concepts by Ramesh

Ramesh Kutumbaka
8/24/2011
OLTP Systems are meant for day-to-day business operations, does not maintain history data and are highly normalized. You can Query on an operational systems for information about specific instances of business objects. For example: You may want just the name and address of a single customer or you may just need to look at a single invoice and the items billed on that single invoice. You do not expect a particular query to run across different Databases, internal data, external data etc., Reasons are: A term like an Account may have different meaning in different systems. Need to standardize and transform the disparate data from the various production systems, convert the data, and integrate the pieces.
DW Provides Insight into all Components of Enterprise Business
8/24/2011
Contd
Which means that there is no conformance of data among the various operational or OLTP Systems of an enterprise.
So.
What we need to do ?
Decision Maker
Building DW/DSS/OLAP/IDS is necessary.
We dont need Systems that are only pretty good at Transactional Processing and not pretty good at Querying.
Ralph Kimball
8/24/2011
Data Warehouse is an information Delivery System (IDS) for strategic Decisions. Basically it is a Decision Support System (DSS)
What we need to do to build the IDS/DSS/DW?
Integrate all the historic data from the various operational Systems, combine this internal data with any relevant data from outside sources, and pull them together in to the DW. Resolve any conflicts in the data the way data resides in different Sources Systems and transform, derive and integrate the data content into a format suitable for providing information to the various category of users. Finally , implement the IDS SS2 SS1 SS8 SS3 SS4 SS5
DW
SS6 SS7
8/24/2011
We need to have different components or building blocks.
These building blocks are arranged together in the most optimal way to serve the intended purpose.
Building blocks are arranged in a suitable Architecture.
5 8/24/2011
Bill Inmon
Bill Inmon is universally recognized as the "father of the data warehouse."
Inmon defined "A DW is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process".
8/24/2011
Ralph Kimball
Ralph is a leading proponent of the dimensional approach to designing large data warehouses.
A Data Warehouse is "a copy of transaction data specifically structured for query and analysis".
This definition provides less insight and depth than Mr. Inmon's, but is no less accurate.
8/24/2011
Bill Inmon's paradigm:
Data warehouse is one part of the overall business intelligence system.
An enterprise has one data warehouse, and data marts source their information from the data warehouse.
An enterprise has one data warehouse, and data marts source their information from the data warehouse.
8/24/2011
Ralph Kimballs paradigm:
Data warehouse is the business of all data marts within the enterprise.
Information is always stored in the dimensional model.
DW
An enterprise has one data warehouse, and data marts source their information from the data warehouse .
8/24/2011
There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. philosophies.
In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. idea.
This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. mart.
Only when more data marts are built later do they evolve into a data warehouse. warehouse.
8/24/2011
10
Sean Kelly is another leading data warehousing practitioner.
The data in the Data warehouse is:
Separate
Available
Integrated Time Stamped Subject Oriented
Nonvolatile
Accessible
8/24/2011
11
For proper decision making, we need to pull together all the relevant data from the various applications.
The data in the data warehouse comes from several operational systems. SS3 Source data are in different databases, files, and data segments. SS2
SS4
SS5
DW
SS6
SS7
These are disparate applications, so the operational platforms and operating systems could be different.
SS1
SS8
The file layouts, characters code representations, and field naming conventions all could be different.
In addition to data from internal operational systems, for many enterprises, data from outside sources is likely to very important and this is one more variation in the mix of source data for a data warehouse.
8/24/2011
12
From these 3 different Source systems

Savings Account
Subject Area
Naming conventions would be different.
Attributes for data items could be Checking Account Account different.
Account number in the saving account application could be Loans Account eight bytes long, but only six bytes in the checking Account Integration of different Source Systems application.
Before moving the Data into the data warehouse, you have to go through a process of transformation, consolidation, and integration of the source data.
8/24/2011
13
Example:
In order to store data, over the years, many application designers in each branch have made their individual decisions as to how an application and database should be built.
So source systems will be different in naming conventions, variable measurements, encoding structures, and physical attributes of data.
Consider a bank that has got several branches in several countries, has millions of customers and the lines of business of the enterprise are savings, and loans.
The following example explains how the data is integrated from source systems to target systems.
8/24/2011
14
System Name Source System 1 Source System 2 Source System 3
Attribute Name Customer Application Date Customer Application Date Application Date
Column Name CUSTOMER_APPLICATION_ DATE CUST_APPLICATION_DATE APPLICATION_DATE
Datatype NUMERIC(8, 0) DATE DATE
Values 11012005 11012005 01NOV200 5
In the aforementioned example, attribute name, column name, data type and values are entirely different from one source system to another.
This inconsistency in data can be avoided by integrating the data into a data warehouse with good standards.
8/24/2011
15
Target System
Record #1 Record #2 Record #3
Attribute Name
Customer Application Date Customer Application Date Customer Application Date
Column Name
CUSTOMER_APPLICATION _DATE CUSTOMER_APPLICATION _DATE CUSTOMER_APPLICATION _DATE
Datatype
DATE DATE DATE
Values
01112005 01112005 01112005
In the above example of target data, attribute names, column names, and data types are consistent throughout the target system.
This is how data from various source systems is integrated and accurately stored into the data warehouse.
8/24/2011
16
In online Transaction Processing Systems (OLTPS):
We capture and store the data by individual Application
Example: Order Processing

The Application users will enter orders, check stock, verifying customers credit, and assigning the order for shipment.
We capture and store the data related to this particular application. Here, we will have data about individual orders, customers, stock status, and detailed transactions, but all of these are structured around the processing of orders.
8/24/2011
17
Operational Applications
Order Processing Consumer Loans
Data Warehouse Subjects
Sales
Product
Consumer Loans
Account Receivables
Customer
Account
Claims Processing
Savings Accounts
Claims
Policy
In Data Warehouse, Data is not stored by operational applications, but by business subjects
8/24/2011
18
In OLTP Systems, the stored data contains the current values
For Examples:
The balance is the current outstanding balance in the customers account The Status of an Order is the Current Status of the Order When an analyst in a grocery chain wants to promote two or more products together, that analyst wants sales of the selected products over a number of past quarters
Of course, in OLTP Systems, we do store some past transactions, but essentially, OLTP Systems reflect current information because these systems support dayto-day current operations
Where As DW is time variant database, supports business community and comparing business with different time periods.
8/24/2011
19
Data warehouse is a non-volatile database. Once data entered into the Data warehouse it should not change.
Data from the OLTP Systems are moved into the DW at Specific intervals. OLTP Databases Loads DW
Depending on the business requirements, these data movements take place twice a day, once a week, or once in two weeks. A d d
Read
In fact, in a typical DW, Data movements to different data sets may takes place at different frequencies.
The changes to the attributes of the product may be moved once a week. OLTP System Applications
Any revisions to geographical setup maybe moved once a month.
The units of sales may be moved once a day.
8/24/2011
20
In an OLTP Systems, the data is captured at the lowest level of the detail. For Example:
In an order Entry System, the quantity order is captured and stored at the units level of a product per order received from the customer.
When ever you need summary data, you add up the individual transactions.
If you are looking for units of a product ordered this month, you read all the orders entered for the entire month for that product and add up.
Note:
We do not keep summary data in the OLTP/operational Systems.
8/24/2011
21
When a user queries the DW for analysis, he/she usually starts by looking at summary data.
Data Warehouse
Aggregated/ The user may start with a total sale units of a product in Summary Summary Data Data an entire region. Detailed Data
Then the user may want to look at the break down by states in the region.
The next step may be the examination of sale units by the next level of individual stores.
Frequently, the analysis starts at a high level and moves down to lower levels of detail .
8/24/2011
22
There are basically two different approaches for building DW
Top-down approach
Bottomup approach
TopTop-down approach :
Overall DW feeding dependent data marts
Data will be extracted from the OLTP Systems
Data will be transformed, clean, integrate, and keep the data in the DW
Bottom Bottomup approach

Departmental or Data Marts will be built first Several Departmental or local Data Marts combining into a DW
8/24/2011
23
Top-down Approach Disparate Source Systems
Bottom-up Approach
DW
DW
DM1
DM2
DM3
DM1
8/24/2011
DM2
DM3
Disparate Source Systems

24
A data warehouse is a relational/multidimensional database that is designed for query and analysis rather than transaction processing.
A data warehouse usually contains historical data that is derived from transaction data.
It separates analysis workload from transaction workload and enables a business to consolidate data from several sources.
In addition to a relational/multidimensional database, a data warehouse environment often consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users .
There are three types of data warehouses.
8/24/2011
25
1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database for decision Support throughout the enterprise.
2. ODS (Operational Data Store) - This has a broad enterprise wide scope, but unlike the real enterprise data warehouse, data is refreshed in near real time and used for routine business activity. 3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region, business unit or business function. Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables.
This is most useful for users to access data since a database can be visualized as a cube of several dimensions.
A data warehouse provides an opportunity for slicing and dicing that cube along each of its dimensions.
8/24/2011
26
A data mart is a subset of data warehouse that is designed for a particular line of business, such as sales, marketing, or finance.
DW
DM1
DM2
DM3
DM4
In a dependent data mart, data can be derived from an enterprise-wide data warehouse. DW
In an independent data mart, data can be collected directly from sources.
DM1
DM2
DM3
DM4
8/24/2011
27
Building Blocks or Components of DW Architecture

Information Delivery
External Data
S O U R C E D A T A
Production Data
D A T A S T A G I N G
D A T A S T O R A G E
Metadata
Multi-Dim Data DBs
Data Mining
DW DBMS
OLAP
DM1 DM2
Internal Data
Archived Data
Report/Query Architecture is the proper Arrangements of the Components 8/24/2011 28
Source data coming to the DW may be groped into four broad categories as shown in the previous slide.
External Data
Production Data
Internal Data
Archive Data
8/24/2011
29
Most Executives depend on data from external sources for a high percentage of the information they use.
They use statistics relating to their industry produced by external agencies.
They use market share data of competitors.
They use standard values of financial indicators for their business to check on their business to check on their performance.
8/24/2011
30
The DW of a car rental company contains data on the current production schedules of the leading automobile manufactures. This external data in the DW helps the car rental company plan for their fleet management.
The purpose served by such external data sources cannot be fulfilled by the data available within the Organization.
Usually, data from outside sources do not conform to your formats.
We have to device conversions of data into your internal formats and data types
Some sources may provide information at regular, stipulates intervals, or may give you data on request
We need to accommodate the variations
8/24/2011
31
This type of data comes from various OLTP or operational systems of the enterprise.
While dealing with this data, you come across many variations in the data formats.
You also notice that the data resides on different hardware platforms.
The Database is supported by different database systems and operating systems.
This the data from many vertical applications.
The significant and disturbing characteristic of production data is disparity.
Need to standardize and transform the disparate data from the various production systems, convert the data, and integrate the pieces into useful data for storage in the DW.
8/24/2011
32
The following data is internal data, parts of which may be required in DW
private spreadsheets
Documents
Customer profiles
sometimes even departmental databases
8/24/2011
33
OLTP or Operational Systems are primarily intended to run the current business
In OLTP or Operational Systems, the old data periodically will be taken and store it in the archived files
DW keeps historical snapshots of data for analysis over time
For getting historical data, need to connect to the Archived Data Sets
Depending on the Business Requirements you have to include sufficient historical data in the DW
This type of data is useful for discerning patterns and analyzed
8/24/2011
34
The extracted data from various disparate Source Systems and external data need to be changed, converted, combined, reduplicate and made it ready in a format that is suitable to be stored for querying and analysis
There three major functions need to be performed for getting the data ready in the Staging Area (SA)
Extract the Data from Source Systems Transforms the Data Load the Data into the DW
8/24/2011
35
The data Storage for the DW is a separate Repository
The Data in the DW in Structures suitable for analysis
In DW any of the following Data Modeling can be used
Star Schema Snowflake Schema Star flake
DWs are Readonly Data Repositories Read
8/24/2011
36
Ad hoc Reports
IDS component includes different methods of information delivery.
Information Delivery System
Online Complex Queries
Ad hoc reports are predefined reports primarily meant for the novice and casual users.
Intranet
MD Analysis
Provision for complex Queries, Multidimensional (MD) analysis.
Internet
Statistical Analysis
Statistical Analysis cater to the needs of the business Analysts and Power Users.
EIS Feed E-mail Data Mining
Information fed into the Executive Information Systems (EIS) is meant for the Senior Executives and high-level managers.
Some DW also provide Data to the Data-Mining Applications are knowledge discovery Systems where the mining algorithms help you discover trends and patterns from the usage of your data.
8/24/2011
37
Metadata in a DW is similar to the Data Dictionary (DD) or the Catalog in a Database Management System
In DD, we can keep the Information about
Logical Data Structures
Information about the Files and Addresses
Information about the Indexes and so on
The DD contains Data about the Data in the Database
8/24/2011
38
Contd
Types Of Metadata:
Operational Metadata
Extraction and Transformation Metadata
End-User Metadata
8/24/2011
39
Example
In general, an organization is started to earn money by selling a product or by providing service to the product. An organization may be at one place or may have several branches.
When we consider an example of an organization selling products throughout the world, the main four major dimensions are product, location, time and organization.
Dimension tables have been explained in detail under the section Dimensions. With this example, we will try to provide detailed explanation about STAR SCHEMA .
8/24/2011
40
Star Schema is a relational database schema for representing multi dimensional data.
It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables.
It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions.
The center of the star schema consists of a large fact table and it points towards the dimension tables.
The advantage of star schema are slicing down, performance increase and easy understanding of data.
8/24/2011
41
1)
Identify a business process for analysis(like sales).
2)
Identify measures or facts (sales dollar).
3)
Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).
4)
List the columns that describe each dimension.(region name, branch name, region name).
5)
Determine the lowest level of summary in a fact table(sales dollar).
8/24/2011
42
Important aspects of Star Schema & Snow Flake Schema
1)
In a star schema every dimension will have a primary key.
2)
In a star schema, a dimension table will not have any parent table.
3)
Whereas in a snow flake schema, a dimension table will have one or more parent tables.
4)
Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
5)
Whereas hierarchies are broken into separate tables in snow flake schema. These hierarchies helps to drill down the data from topmost hierarchies to the lowermost hierarchies.
8/24/2011
43
A logical structure that uses ordered levels as a means of organizing data.
A hierarchy can be used to define data aggregation; for example, in a time dimension, a hierarchy might be used to aggregate data from the Month level to the Quarter level, from the Quarter level to the Year level.
A hierarchy can also be used to define a navigational drill path, regardless of whether the levels in the hierarchy represent aggregated totals or not.
Level A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels.
8/24/2011
44
A table in a star schema that contains facts and connected to dimensions.
A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables.
The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables).
A fact table usually contains facts with the same level of aggregation.
8/24/2011
45
A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. i.e. dimension table hierarchies are broken into simpler tables.
In star schema example we had 4 dimensions like location, product, time, organization and a fact table (sales).
In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup tables and 1 fact table.
The reason is that hierarchies (category, branch, state, and month) are being broken out of the dimension tables (PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively and shown separately.
In OLAP, this Snowflake schema approach increases the number of joins and poor performance in retrieval of data.
In few organizations, they try to normalize the dimension tables to save space. Since dimension tables hold less space, Snowflake schema approach may be avoided.
8/24/2011
46
Additive - Measures that can be added across all dimensions.
Non Additive - Measures that cannot be added across all dimensions.
Semi Additive - Measures that can be added across few dimensions and not with others.
A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables).
In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Fact less Fact tables.
8/24/2011
47
1)
Identify a business process for analysis(like sales).
2)
Identify measures or facts (sales dollar).
3)
Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).
4)
List the columns that describe each dimension.(region name, branch name, region name).
5)
Determine the lowest level of summary in a fact table(sales dollar).
8/24/2011
48
8/24/2011
49

Data Warehouse Concepts by Ramesh

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Warehouse Concepts by Ramesh

Transféré par

Droits d'auteur :

Formats disponibles

Ramesh Kutumbaka

DW Provides Insight into all Components of Enterprise Business

Building DW/DSS/OLAP/IDS is necessary.

What we need to do to build the IDS/DSS/DW?

We need to have different components or building blocks.

Building blocks are arranged in a suitable Architecture.

Bill Inmon is universally recognized as the "father of the data warehouse."

Bill Inmon's paradigm:

Data warehouse is one part of the overall business intelligence system.

Ralph Kimballs paradigm:

Information is always stored in the dimensional model.

Sean Kelly is another leading data warehousing practitioner.

The data in the Data warehouse is:

Integrated Time Stamped Subject Oriented

From these 3 different Source systems

Naming conventions would be different.

Attributes for data items could be Checking Account Account different.

System Name Source System 1 Source System 2 Source System 3

Column Name CUSTOMER_APPLICATION_ DATE CUST_APPLICATION_DATE APPLICATION_DATE

Datatype NUMERIC(8, 0) DATE DATE

Values 11012005 11012005 01NOV200 5

In online Transaction Processing Systems (OLTPS):

We capture and store the data by individual Application

Example: Order Processing

Data Warehouse Subjects

In OLTP Systems, the stored data contains the current values

Any revisions to geographical setup maybe moved once a month.

The units of sales may be moved once a day.

There are basically two different approaches for building DW

Data will be extracted from the OLTP Systems

Bottom Bottomup approach

Top-down Approach Disparate Source Systems

Disparate Source Systems

There are three types of data warehouses.

In an independent data mart, data can be collected directly from sources.

Building Blocks or Components of DW Architecture

Multi-Dim Data DBs

Report/Query Architecture is the proper Arrangements of the Components 8/24/2011 28

They use statistics relating to their industry produced by external agencies.

They use market share data of competitors.

Usually, data from outside sources do not conform to your formats.

We need to accommodate the variations

The Database is supported by different database systems and operating systems.

This the data from many vertical applications.

The significant and disturbing characteristic of production data is disparity.

The following data is internal data, parts of which may be required in DW

sometimes even departmental databases

DW keeps historical snapshots of data for analysis over time

This type of data is useful for discerning patterns and analyzed

The data Storage for the DW is a separate Repository

The Data in the DW in Structures suitable for analysis

In DW any of the following Data Modeling can be used

Star Schema Snowflake Schema Star flake

DWs are Readonly Data Repositories Read

IDS component includes different methods of information delivery.

Information Delivery System

Online Complex Queries

Provision for complex Queries, Multidimensional (MD) analysis.

EIS Feed E-mail Data Mining

In DD, we can keep the Information about

Logical Data Structures

Information about the Files and Addresses

Information about the Indexes and so on