Vous êtes sur la page 1sur 49

Ramesh Kutumbaka

8/24/2011

OLTP Systems are meant for day-to-day business operations, does not maintain history data and are highly normalized. You can Query on an operational systems for information about specific instances of business objects. For example: You may want just the name and address of a single customer or you may just need to look at a single invoice and the items billed on that single invoice. You do not expect a particular query to run across different Databases, internal data, external data etc., Reasons are: A term like an Account may have different meaning in different systems. Need to standardize and transform the disparate data from the various production systems, convert the data, and integrate the pieces.

DW Provides Insight into all Components of Enterprise Business

8/24/2011

Contd

Which means that there is no conformance of data among the various operational or OLTP Systems of an enterprise.

So.

What we need to do ?

Decision Maker

Building DW/DSS/OLAP/IDS is necessary.

We dont need Systems that are only pretty good at Transactional Processing and not pretty good at Querying.

Ralph Kimball

8/24/2011

Data Warehouse is an information Delivery System (IDS) for strategic Decisions. Basically it is a Decision Support System (DSS)

What we need to do to build the IDS/DSS/DW?

Integrate all the historic data from the various operational Systems, combine this internal data with any relevant data from outside sources, and pull them together in to the DW. Resolve any conflicts in the data the way data resides in different Sources Systems and transform, derive and integrate the data content into a format suitable for providing information to the various category of users. Finally , implement the IDS SS2 SS1 SS8 SS3 SS4 SS5

DW

SS6 SS7

8/24/2011

We need to have different components or building blocks.

These building blocks are arranged together in the most optimal way to serve the intended purpose.

Building blocks are arranged in a suitable Architecture.

5 8/24/2011

Bill Inmon

Bill Inmon is universally recognized as the "father of the data warehouse."

Inmon defined "A DW is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process".

8/24/2011

Ralph Kimball
Ralph is a leading proponent of the dimensional approach to designing large data warehouses.

A Data Warehouse is "a copy of transaction data specifically structured for query and analysis".

This definition provides less insight and depth than Mr. Inmon's, but is no less accurate.

8/24/2011

Bill Inmon's paradigm:

Data warehouse is one part of the overall business intelligence system.

An enterprise has one data warehouse, and data marts source their information from the data warehouse.

An enterprise has one data warehouse, and data marts source their information from the data warehouse.

8/24/2011

Ralph Kimballs paradigm:

Data warehouse is the business of all data marts within the enterprise.

Information is always stored in the dimensional model.

DW

An enterprise has one data warehouse, and data marts source their information from the data warehouse .

8/24/2011

There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. philosophies.

In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. idea.

This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. mart.

Only when more data marts are built later do they evolve into a data warehouse. warehouse.

8/24/2011

10

Sean Kelly is another leading data warehousing practitioner.

The data in the Data warehouse is:

Separate

Available

Integrated Time Stamped Subject Oriented

Nonvolatile

Accessible

8/24/2011

11

For proper decision making, we need to pull together all the relevant data from the various applications.

The data in the data warehouse comes from several operational systems. SS3 Source data are in different databases, files, and data segments. SS2

SS4

SS5

DW

SS6

SS7

These are disparate applications, so the operational platforms and operating systems could be different.

SS1

SS8

The file layouts, characters code representations, and field naming conventions all could be different.

In addition to data from internal operational systems, for many enterprises, data from outside sources is likely to very important and this is one more variation in the mix of source data for a data warehouse.

8/24/2011

12

From these 3 different Source systems


Savings Account

Subject Area

Naming conventions would be different.

Attributes for data items could be Checking Account Account different.

Account number in the saving account application could be Loans Account eight bytes long, but only six bytes in the checking Account Integration of different Source Systems application.

Before moving the Data into the data warehouse, you have to go through a process of transformation, consolidation, and integration of the source data.

8/24/2011

13

Example:
In order to store data, over the years, many application designers in each branch have made their individual decisions as to how an application and database should be built.

So source systems will be different in naming conventions, variable measurements, encoding structures, and physical attributes of data.

Consider a bank that has got several branches in several countries, has millions of customers and the lines of business of the enterprise are savings, and loans.

The following example explains how the data is integrated from source systems to target systems.

8/24/2011

14

System Name Source System 1 Source System 2 Source System 3

Attribute Name Customer Application Date Customer Application Date Application Date

Column Name CUSTOMER_APPLICATION_ DATE CUST_APPLICATION_DATE APPLICATION_DATE

Datatype NUMERIC(8, 0) DATE DATE

Values 11012005 11012005 01NOV200 5

In the aforementioned example, attribute name, column name, data type and values are entirely different from one source system to another.

This inconsistency in data can be avoided by integrating the data into a data warehouse with good standards.

8/24/2011

15

Target System
Record #1 Record #2 Record #3

Attribute Name
Customer Application Date Customer Application Date Customer Application Date

Column Name
CUSTOMER_APPLICATION _DATE CUSTOMER_APPLICATION _DATE CUSTOMER_APPLICATION _DATE

Datatype
DATE DATE DATE

Values
01112005 01112005 01112005

In the above example of target data, attribute names, column names, and data types are consistent throughout the target system.

This is how data from various source systems is integrated and accurately stored into the data warehouse.

8/24/2011

16

In online Transaction Processing Systems (OLTPS):

We capture and store the data by individual Application

Example: Order Processing


The Application users will enter orders, check stock, verifying customers credit, and assigning the order for shipment.

We capture and store the data related to this particular application. Here, we will have data about individual orders, customers, stock status, and detailed transactions, but all of these are structured around the processing of orders.

8/24/2011

17

Operational Applications
Order Processing Consumer Loans

Data Warehouse Subjects

Sales

Product

Consumer Loans

Account Receivables

Customer

Account

Claims Processing

Savings Accounts

Claims

Policy

In Data Warehouse, Data is not stored by operational applications, but by business subjects

8/24/2011

18

In OLTP Systems, the stored data contains the current values

For Examples:
The balance is the current outstanding balance in the customers account The Status of an Order is the Current Status of the Order When an analyst in a grocery chain wants to promote two or more products together, that analyst wants sales of the selected products over a number of past quarters

Of course, in OLTP Systems, we do store some past transactions, but essentially, OLTP Systems reflect current information because these systems support dayto-day current operations

Where As DW is time variant database, supports business community and comparing business with different time periods.

8/24/2011

19

Data warehouse is a non-volatile database. Once data entered into the Data warehouse it should not change.

Data from the OLTP Systems are moved into the DW at Specific intervals. OLTP Databases Loads DW

Depending on the business requirements, these data movements take place twice a day, once a week, or once in two weeks. A d d

Read

In fact, in a typical DW, Data movements to different data sets may takes place at different frequencies.

The changes to the attributes of the product may be moved once a week. OLTP System Applications

Any revisions to geographical setup maybe moved once a month.

The units of sales may be moved once a day.

8/24/2011

20

In an OLTP Systems, the data is captured at the lowest level of the detail. For Example:

In an order Entry System, the quantity order is captured and stored at the units level of a product per order received from the customer.

When ever you need summary data, you add up the individual transactions.

If you are looking for units of a product ordered this month, you read all the orders entered for the entire month for that product and add up.

Note:
We do not keep summary data in the OLTP/operational Systems.

8/24/2011

21

When a user queries the DW for analysis, he/she usually starts by looking at summary data.

Data Warehouse
Aggregated/ The user may start with a total sale units of a product in Summary Summary Data Data an entire region. Detailed Data

Then the user may want to look at the break down by states in the region.

The next step may be the examination of sale units by the next level of individual stores.

Frequently, the analysis starts at a high level and moves down to lower levels of detail .

8/24/2011

22

There are basically two different approaches for building DW

Top-down approach

Bottomup approach

TopTop-down approach :
Overall DW feeding dependent data marts

Data will be extracted from the OLTP Systems

Data will be transformed, clean, integrate, and keep the data in the DW

Bottom Bottomup approach


Departmental or Data Marts will be built first Several Departmental or local Data Marts combining into a DW

8/24/2011

23

Top-down Approach Disparate Source Systems

Bottom-up Approach

DW

DW

DM1

DM2

DM3

DM1
8/24/2011

DM2

DM3

Disparate Source Systems


24

A data warehouse is a relational/multidimensional database that is designed for query and analysis rather than transaction processing.

A data warehouse usually contains historical data that is derived from transaction data.

It separates analysis workload from transaction workload and enables a business to consolidate data from several sources.

In addition to a relational/multidimensional database, a data warehouse environment often consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users .

There are three types of data warehouses.

8/24/2011

25

1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database for decision Support throughout the enterprise.

2. ODS (Operational Data Store) - This has a broad enterprise wide scope, but unlike the real enterprise data warehouse, data is refreshed in near real time and used for routine business activity. 3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region, business unit or business function. Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables.

This is most useful for users to access data since a database can be visualized as a cube of several dimensions.

A data warehouse provides an opportunity for slicing and dicing that cube along each of its dimensions.

8/24/2011

26

A data mart is a subset of data warehouse that is designed for a particular line of business, such as sales, marketing, or finance.

DW

DM1

DM2

DM3

DM4

In a dependent data mart, data can be derived from an enterprise-wide data warehouse. DW

In an independent data mart, data can be collected directly from sources.

DM1

DM2

DM3

DM4

8/24/2011

27

Building Blocks or Components of DW Architecture


Information Delivery
External Data

S O U R C E D A T A

Production Data

D A T A S T A G I N G

D A T A S T O R A G E

Metadata

Multi-Dim Data DBs

Data Mining

DW DBMS
OLAP
DM1 DM2

Internal Data

Archived Data

Report/Query Architecture is the proper Arrangements of the Components 8/24/2011 28

Source data coming to the DW may be groped into four broad categories as shown in the previous slide.

External Data

Production Data

Internal Data

Archive Data

8/24/2011

29

Most Executives depend on data from external sources for a high percentage of the information they use.

They use statistics relating to their industry produced by external agencies.

They use market share data of competitors.

They use standard values of financial indicators for their business to check on their business to check on their performance.

8/24/2011

30

The DW of a car rental company contains data on the current production schedules of the leading automobile manufactures. This external data in the DW helps the car rental company plan for their fleet management.

The purpose served by such external data sources cannot be fulfilled by the data available within the Organization.

Usually, data from outside sources do not conform to your formats.

We have to device conversions of data into your internal formats and data types

Some sources may provide information at regular, stipulates intervals, or may give you data on request

We need to accommodate the variations

8/24/2011

31

This type of data comes from various OLTP or operational systems of the enterprise.

While dealing with this data, you come across many variations in the data formats.

You also notice that the data resides on different hardware platforms.

The Database is supported by different database systems and operating systems.

This the data from many vertical applications.

The significant and disturbing characteristic of production data is disparity.

Need to standardize and transform the disparate data from the various production systems, convert the data, and integrate the pieces into useful data for storage in the DW.

8/24/2011

32

The following data is internal data, parts of which may be required in DW

private spreadsheets

Documents

Customer profiles

sometimes even departmental databases

8/24/2011

33

OLTP or Operational Systems are primarily intended to run the current business

In OLTP or Operational Systems, the old data periodically will be taken and store it in the archived files

DW keeps historical snapshots of data for analysis over time

For getting historical data, need to connect to the Archived Data Sets

Depending on the Business Requirements you have to include sufficient historical data in the DW

This type of data is useful for discerning patterns and analyzed

8/24/2011

34

The extracted data from various disparate Source Systems and external data need to be changed, converted, combined, reduplicate and made it ready in a format that is suitable to be stored for querying and analysis

There three major functions need to be performed for getting the data ready in the Staging Area (SA)

Extract the Data from Source Systems Transforms the Data Load the Data into the DW

8/24/2011

35

The data Storage for the DW is a separate Repository

The Data in the DW in Structures suitable for analysis

In DW any of the following Data Modeling can be used

Star Schema Snowflake Schema Star flake

DWs are Readonly Data Repositories Read

8/24/2011

36

Ad hoc Reports

IDS component includes different methods of information delivery.

Information Delivery System

Online Complex Queries

Ad hoc reports are predefined reports primarily meant for the novice and casual users.

Intranet

MD Analysis

Provision for complex Queries, Multidimensional (MD) analysis.

Internet

Statistical Analysis

Statistical Analysis cater to the needs of the business Analysts and Power Users.

EIS Feed E-mail Data Mining

Information fed into the Executive Information Systems (EIS) is meant for the Senior Executives and high-level managers.

Some DW also provide Data to the Data-Mining Applications are knowledge discovery Systems where the mining algorithms help you discover trends and patterns from the usage of your data.

8/24/2011

37

Metadata in a DW is similar to the Data Dictionary (DD) or the Catalog in a Database Management System

In DD, we can keep the Information about

Logical Data Structures

Information about the Files and Addresses

Information about the Indexes and so on

The DD contains Data about the Data in the Database

8/24/2011

38

Contd
Types Of Metadata:

Operational Metadata

Extraction and Transformation Metadata

End-User Metadata

8/24/2011

39

Example

In general, an organization is started to earn money by selling a product or by providing service to the product. An organization may be at one place or may have several branches.

When we consider an example of an organization selling products throughout the world, the main four major dimensions are product, location, time and organization.

Dimension tables have been explained in detail under the section Dimensions. With this example, we will try to provide detailed explanation about STAR SCHEMA .

8/24/2011

40

Star Schema is a relational database schema for representing multi dimensional data.

It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables.

It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions.

The center of the star schema consists of a large fact table and it points towards the dimension tables.

The advantage of star schema are slicing down, performance increase and easy understanding of data.

8/24/2011

41

1)

Identify a business process for analysis(like sales).

2)

Identify measures or facts (sales dollar).

3)

Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).

4)

List the columns that describe each dimension.(region name, branch name, region name).

5)

Determine the lowest level of summary in a fact table(sales dollar).

8/24/2011

42

Important aspects of Star Schema & Snow Flake Schema

1)

In a star schema every dimension will have a primary key.

2)

In a star schema, a dimension table will not have any parent table.

3)

Whereas in a snow flake schema, a dimension table will have one or more parent tables.

4)

Hierarchies for the dimensions are stored in the dimensional table itself in star schema.

5)

Whereas hierarchies are broken into separate tables in snow flake schema. These hierarchies helps to drill down the data from topmost hierarchies to the lowermost hierarchies.

8/24/2011

43

A logical structure that uses ordered levels as a means of organizing data.

A hierarchy can be used to define data aggregation; for example, in a time dimension, a hierarchy might be used to aggregate data from the Month level to the Quarter level, from the Quarter level to the Year level.

A hierarchy can also be used to define a navigational drill path, regardless of whether the levels in the hierarchy represent aggregated totals or not.

Level A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels.

8/24/2011

44

A table in a star schema that contains facts and connected to dimensions.

A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables.

The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.

A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables).

A fact table usually contains facts with the same level of aggregation.

8/24/2011

45

A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. i.e. dimension table hierarchies are broken into simpler tables.

In star schema example we had 4 dimensions like location, product, time, organization and a fact table (sales).

In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup tables and 1 fact table.

The reason is that hierarchies (category, branch, state, and month) are being broken out of the dimension tables (PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively and shown separately.

In OLAP, this Snowflake schema approach increases the number of joins and poor performance in retrieval of data.

In few organizations, they try to normalize the dimension tables to save space. Since dimension tables hold less space, Snowflake schema approach may be avoided.

8/24/2011

46

Additive - Measures that can be added across all dimensions.

Non Additive - Measures that cannot be added across all dimensions.

Semi Additive - Measures that can be added across few dimensions and not with others.

A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables).

In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Fact less Fact tables.

8/24/2011

47

1)

Identify a business process for analysis(like sales).

2)

Identify measures or facts (sales dollar).

3)

Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).

4)

List the columns that describe each dimension.(region name, branch name, region name).

5)

Determine the lowest level of summary in a fact table(sales dollar).

8/24/2011

48

8/24/2011

49

Vous aimerez peut-être aussi