Vous êtes sur la page 1sur 126

IBM Software Group

2007 IBM Corporation

IBM Software Group | WebSphere software

4/28/2012

TCS Confidential

2
2

IBM Software Group | WebSphere software

Course Roadmap Why we use Data warehousing


Difference between Operational System and Data Warehouse

Introduction to Data warehousing


Data Warehousing Approaches Data Warehouse Technical Architecture Data Modelling concepts Operational Data Store Schema Design of Data warehouse Data Acquisation ETL Products Project Life Cycle
3

IBM Software Group | WebSphere software

Why We Need Data Warehousing ?


Better business intelligence for end-users Reduction in time to locate, access, and analyze information Consolidation of disparate information sources To Store Large Volumes of Historical Detail Data from Mission Critical Applications Strategic advantage over competitors Faster time-to-market for products and services Replacement of older, less-responsive decision support systems Reduction in demand on IS to generate reports

IBM Software Group | WebSphere software

What is an Operational System?


Operational systems are just what their name implies; they are the systems that help us run the day-to-day enterprise operations.

These are the backbone systems of any enterprise, such as order entry inventory
etc.

The classic examples are airline reservations, credit-card authorizations, and ATM withdrawals etc.,

IBM Software Group | WebSphere software

Characteristics of Operational Systems


Continuous availability Predefined access paths Transaction integrity Volume of transaction - High Data volume per query - Low Used by operational staff

Supports day to day control operations


Large number of users
6

IBM Software Group | WebSphere software

OLTP Vs Data Warehouse


Operational System
Transaction Processing Predictable CPU Usage Time Sensitive Operator View Normalized Efficient Design for TP

Data Warehouse
Query Processing Random CPU Usage History Oriented Managerial View Denormalized Design for Query Processing

IBM Software Group | WebSphere software

OLTP Vs Warehouse Operational System


Designed for Atmocity, Consistency, Isolation and Durability Organized by transactions (Order, Input, Inventory)

Data Warehouse
Designed for quite or static database Organized by subject (Customer, Product)

Relatively smaller database Large database size Many concurrent users Volatile Data Relatively few concurrent users Non Volatile Data

IBM Software Group | WebSphere software

Operational System
Stores all data Performance Sensitive Not Flexible Efficiency

Data Warehouse
Stores relevant data Less Sensitive to performance Flexible Effectiveness

IBM Software Group | WebSphere software

What is a Data Warehouse ?


Data Warehouse is a Subject-Oriented Integrated Time-Variant Non-volatile

WH Inmon - Regarded As Father Of Data Warehousing

10

IBM Software Group | WebSphere software

Subject Oriented Analysis


Process Oriented Subject Oriented

Entry
Sales Rep Quantity Sold Part Number Date Customer Name Product Description Unit Price Mail Address

Sales Customers Products

Transactional Storage

Data Warehouse Storage


11
11

IBM Software Group | WebSphere software

Integration of Data
Encoding
Appl. A - M, F Appl. B - 1, 0 Appl. C - X, Y Appl. A - pipeline cm. Appl. B - pipeline inches Appl. C - pipeline mcf Appl. A - balance dec(13,2) Appl. B - balance PIC 9(9)V99 Appl. C - balance float Appl. A - bal-on-hand Appl. B - current_balance Appl. C - balance Appl. A - date (Julian) Appl. B - date (yymmdd) Appl. C - date (absolute) M, F

Physical Attributes Naming Conventions Data Consistency

Integration

Unit of Attributes

pipeline cm

balance dec(13, 2)

balance

date (Julian)

Transactional Storage

Data Warehouse Storage


12
12

IBM Software Group | WebSphere software

Volatility of Data
Volatile
Insert Change

Non-Volatile

Delete Insert
Change Access Record-by-Record Data Manipulation Mass Load / Access of Data Load

Access

Transactional Storage

Data Warehouse Storage


13
13

IBM Software Group | WebSphere software

Time Variant Data Analysis


Current Data Historical Data

Sales ( Region , Year - Year 97 - 1st Qtr)


20 15 Sales ( in lakhs 10 ) 5 0 January February March Year97 East West North

Transactional Storage

Data Warehouse Storage


14
14

IBM Software Group | Differences Datawarehouse-WebSphere software from Operational Systems

Operational systems Database

Data warehouse

Update Insert
Update Delete Insert

Load/ Update

Initial Load

Incremental Load Incremental Load

Constant Change

Consistent Points in Time

Updated constantly

Added to regularly, but loaded data

is rarely directly changed

Data changes according to need, not a fixed schedule

Does NOT mean the Data warehouse is never updated or never changes!!

IBM Software Group | WebSphere software

Difference B/W OLTP AND OLAP

16

IBM Software Group | WebSphere software

DW Implementation Approaches
Top Down Bottom-up

Combination of both
Choices depend on:
current infrastructure resources architecture ROI Implementation speed

17

EDW- Top DownApproach


Heterogeneous Source Systems Source 1 Source 2 Source 3

IBM Software Group | WebSphere software

Common Staging interface Layer

Staging

Data mart bus architecture Layer

Enterprise Datawarehouse

Incremental Architected data marts DM 2 DM 1 DM 3

18

EDW- Bottom upApproach


Heterogeneous Source Systems Source 1 Source 2 Source 3

IBM Software Group | WebSphere software

Common Staging interface Layer

Staging

Data mart bus architecture Layer

Incremental Architected data marts DM 2 DM 1 DM 3

Enterprise Datawarehouse

19

Ralph Kimball Approach


Extract
Services: Transform from source-to-Target Maintain Conform

IBM Software Group | WebSphere software

Data Mart Bus: Conformed facts and dims Data Mart #1 Dimensional Atomic AND summery data Business Process Centric Design Goals: Easy-of -use Query Performance Data Mart #2 Ad Hoc Query Tools Report Writers

Extract

Dimensions No user query support Data Store: Flat files or relational tables

Access Analytic Applications


Modeling: Forecasting Scoring Data Mining

Load

Extract

Design Goals: Staging Throughput integrity/ consistency

Data Mart #.....

Source System

Data Staging Area

Presentation Area

Data Access Tools


20

Independent Data Marts: Ralph Kimballs Ideology

IBM Software Group | WebSphere software

Bottom Up Approach
Integrated Data Timely User Access Conformed Dimensions Single Process to Build Dimension

Staging Data Store

E/R Design or Flat File Retain History Needed for regular processing No end user access

Data Warehouse
Data Mart

Data Mart
Data Mart

Data Mart

Data Mart

Data Mart

Dimensional Transaction & Summary data Data Mart Single subject area (i.e. Fact table) Multiple Marts May exist in a Single Database Instance

21

IBM Software Group | WebSphere software

Bill Inmon Approach


Data Mart #1 Extract Dimensional summery data Departmental Centric

Enterprise Data Warehouse

Extract

Normalized tables Atomic Data


Load

ETL Data Mart #2

Access

Extract

User query support to atomic data

Data Mart #...

Access

Source System

Data Staging Area

DWH

Presentation Area

Data Access Tools

Dependent Data Marts: Bill Inmons Ideology


22

IBM Software Group | WebSphere software

Top Down Approach


Staging Data Store
Raw Input Data

Integrated Data Timely user Access Single Process to build dimension

Data Warehouse

E/R Model Subject Areas Transaction Level Detail Historical Persistency As justified- Archive for Retrieval if Needed

Most are dimensional Data Mart Design by Business Function Summary Level Data

Data Mart

Flat File

Data Mart Data Mart

Data Mart

23

IBM Software Group | WebSphere software

DW Implementation Approaches
Top Down
More planning and design initially Involve people from different workgroups, departments Data marts may be built later from Global DW

Bottom Up

Can plan initially without waiting for global infrastructure


built incrementally can be built before or in parallel with Global DW

Overall data model to be decided upfront

Less complexity in design

24

IBM Software Group | WebSphere software

DW Implementation Approaches

Top Down
Consistent data definition and enforcement of business rules across enterprise
High cost, lengthy process, time consuming

Bottom Up
Data redundancy and inconsistency between data marts may occur Integration requires great planning Less cost of H/W and other resources Faster pay-back

Works well when there is centralized IS department responsible for all H/W and resources

25

IBM Software Group | WebSphere software

DW Architectures

26
26

Data Sources Software Group |Software software ETL WebSphere Data Stores IBM
Transaction Data Prod
S T A G I N G A R E A O P E R A T I O N A L D A T A

Data Analysis Tools and Applications

Users

IBM IMS
Ascential

SQL

ANALYSTS

Mkt

Cognos Teradata IBM Load DATASTAGE Data Warehouse Data Marts Finance Essbase Marketing Meta Data Queries,Reporting, DSS/EIS, Data Mining EXECUTIVES Micro Strategy Sales Microsoft Siebel Business Objects Web Browser CUSTOMERS/ SUPPLIERS 27 OPERATIONAL PERSONNEL SAS MANAGERS

HR

VSAM

Fin

Oracle
Extract

Acctg

Sybase

Other Internal Data ERP Web Data


Clickstream

SAP

Sagent

Informix
SAS

External Data
Demographic

HarteHanks

S T O R E

Clean/Scrub Transform Firstlogic

IBM Software Group | WebSphere software

Benefits of DWH
To formulate effective business, marketing

and sales strategies.


To precisely target promotional activity.

To discover and penetrate new markets.


To successfully compete in the marketplace

from a position of informed strength.


To build predictive rather than retrospective models.
28

IBM Software Group | WebSphere software

Data Modeling

29

IBM Software Group | WebSphere software

Data Modeling

WHAT IS A DATA MODEL?

A data model is an abstraction of some aspect of the real world (system).


WHY A DATA MODEL?

Helps to visualize the business A model is a means of communication. Models help elicit and document requirements. Models reduce the cost of change. Model is the essence of DW architecture based on which DW will be implemented

30

IBM Software want to do with What do weGroup | WebSphere software the data?

What do we want to do with the data?

Model depends on what kind of data analysis we want to do: Different Data Analysis Techniques
Query and reporting
Display Query Results

Multidimensional analysis
Analyse data content by looking at it in different perspectives

Data mining
discover patterns and clustering attributes in data

IBM Software Group | WebSphere software

Impact of Data Analysis Techniques on DM


Query and reporting
Normalized data model Select associated data elements summarize and group by category present results direct table scan ER with normalized / denormalized appropriate

IBM Software Group | WebSphere software

Impact of Data Analysis Techniques on DM


Multidimensional analysis

Fast and easy access to data Any number of analysis dimensions in any combinations ER will mean many joins Dimensional model appropriate

IBM Software Group | WebSphere software

STEPS in DATA MODELING


Problem & scope definition Requirement Gathering

Analysis

Logical Database Design


Deciding Database Physical Database design Schema Generation
34

IBM Software Group | WebSphere software

What needs to be modeled during a data warehouse project


STAGING AREA YES ! (maybe multiple data models are required) ODS YES !

DATAWAREHOUSE/DATAMART YES!

IBM Software Group | WebSphere software

Levels of modeling Conceptual modeling Describe data requirements from a business point of view without technical details
Logical modeling Refine conceptual models Data structure oriented, platform independent Physical modeling Detailed specification of what is physically implemented using specific technology
36

IBM Software Group | WebSphere software

Modeling Techniques
Entity-Relationship Modeling Traditional modeling technique Technique of choice for OLTP Suited for corporate data warehouse Dimensional Modeling Analyzing business measures in the specific business context Helps visualize very abstract business questions End users can easily understand and navigate the data structure

37

IBM Modeling - Basic Concepts Entity-RelationshipSoftware Group | WebSphere software

Relationship Relationship between entities - structural interaction

and association
described by a verb Cardinality 1-1 1-M

M-M
Example : Books belong to Printed Media
38

IBM Software Group | WebSphere software

Entity-Relationship Modeling - Basic Concepts Attributes Characteristics and properties of entities Example : Book Id, Description, book category are attributes of entity Book Attribute name should be unique and selfexplanatory Primary Key, Foreign Key, Constraints are defined on Attributes

39

IBM Software Group | WebSphere software

Review of Logical Modeling Terms & Symbols

Entities define specific groups of information

Sales Organization Sales Org ID Distribution Channel

Entity

IBM Software Group | WebSphere software

Review of Logical Modeling Terms & Symbols


One or more attribute uniquely identifies an instance of an entity

Sales Organization Sales Org ID Distribution Channel Identifier

IBM ReviewSoftware Group | WebSphere software Terms & Symbols of Logical Modeling

The logical model identifies relationships between entities

Sales Detail Sales Record ID

Sales Rep Sales Rep ID

{
Relationship

IBM Software Group | WebSphere software

Logical Data Model


Suppliers Supplier ID Customer Customer ID Retail Market Wholesale Industry

Manufacturing Group Manufacturing Org ID

Sales Detail Sales Record ID

Sales Rep Sales Rep ID

Sales Organization Sales Org ID Distribution Channel

Factory Factory ID

Product Product SKU

Product Sales Plan Plan ID

IBM Software Group | WebSphere software

Examples: ER Model

44
44

IBM Software Group | WebSphere software

Limitations of E-R Modeling

Poor Performance Tend to be very complex and difficult to navigate.

45

IBM Software Group | WebSphere software

Dimensional Modeling

46
46

IBM Software Group | WebSphere software

Dimensional Modeling

Dimensional modeling uses three basic concepts : measures, facts, dimensions. Is powerful in representing the requirements of the business user in the context of database tables. Focuses on numeric data, such as values counts, weights, balances and occurences.

47

IBM Software Group | WebSphere software

What is a Facts
A fact is a collection of related data items, consisting of measures and context data. Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business process.

Facts are measured, continuously valued, rapidly changing information. Can be calculated and/or derived.
Granularity The level of detail of data contained in the data warehouse e.g. Daily item totals by product, by store

48

IBM Software Group | WebSphere software

Types of Facts
Additive Able to add the facts along all the dimensions Discrete numerical measures eg. Retail sales in $ Semi Additive Snapshot, taken at a point in time Measures of Intensity Not additive along time dimension eg. Account balance, Inventory balance Added and divided by number of time period to get a time-average Non Additive Numeric measures that cannot be added across any dimensions Intensity measure averaged across all dimensions eg. Room temperature Textual facts - AVOID THEM

49

IBM Software Group | WebSphere software

Dimensions
A dimension is a collection of members or units of the same type of views. Dimensions determine the contextual background for the facts.

Dimensions represent the way business people talk about the data resulting from a business process, e.g., who, what, when, where, why, how

50

IBM Software Group | WebSphere software

Dimensional Hierarchy
Geography Dimension
World Level

World America USA FL Miami GA Tampa VA Europe Canada CA WA Naples Dimension Member /
Business Entity
51
51

Continent Level

Asia Argentina

Country Level

State Level

City Level

Orlando

Attributes: Population, Tourists Place

IBM Software Group | WebSphere software

Dimensions Types Conformed Dimension Junk Dimension Fast Changing Dimension Role Playing Dimension

Garbage Dimension
Slowly Changing Dimension Degenerated Dimension

52
52

IBM Software Group | WebSphere software

What is a Slowly Changing Dimension?


Although dimension tables are typically static lists, most dimension tables do change over time.

Since these changes are smaller in magnitude compared to changes in fact tables, these dimensions are known as slowly growing or slowly changing dimensions.

53

IBM Software Group | WebSphere software

Slowly Changing Dimension -Classification

Slowly changing dimensions are classified into three different types TYPE I TYPE II TYPE III

54

IBM Software Group | WebSphere software

Slowly Changing Dimensions Type I Source


Emp id Name Email Emp id

Target
Name Email

1001 1001 Shane Shane@xyz.com

Shane

Shane@xyz.com

Source
Emp id Name Email Emp id

Target
Name Email

1001

Shane

Shane@ abc.co.in

1001

Shane

Shane@ abc.co.in

Shane@ xyz.com

55

IBM Software Group | WebSphere software

Slowly Changing Dimensions Type II

Source
Emp id Name Email PM_PRI MARY KEY Emp id

Target
Name Email PM_VER SION_N UMBER

10

Shane

Shane@xyz.com 1000 10 Shane Shane@x yz. com 0

56

IBM Software Group | WebSphere software

Slowly Changing Dimensions -Versioning Source


Emp id Name Email

10

Shane

Shane@ abc.co.in

PM_PRIMA RYKEY

Emp id

Name

Email

PM_VERSION_NUMBER

1000

10

Shane

Shane@ xyz.com Shane@ abc.co.in

1001

10

Shane

Target

57

IBM Software Group | WebSphere software

Slowly Changing Dimensions -Versioning

Source
Emp id Name Email

10

Shane

Shane@ abc.com

PM_PRIM ARYKEY

Emp id

Name

Email

PM_VERSION_NUM BER

Target

1000

10

Shane

Shane@ xyz.com
Shane@ abc.co.in Shane@ abc.com

1001

10

Shane

1003

10

Shane

58

IBM Software Group | WebSphere software

Slowly Changing Dimensions Type II Flag

Emp id

Name

Email

PM_PR IMAR YKEY

Emp id

Name

Email

PM_CUR RENT_FL AG

10

Shane

Shane@xyz.co m

1000

10

Shane

Shane@ xyz. com

Source Target

59

IBM Software Group | WebSphere software

Slowly Changing Dimensions - Flag Current Source


Emp id Name Email

10

Shane

Shane@ abc.co.in

PM_PRIMA RYKEY

Emp id

Name

Email

PM_CURRENT_FLAG

1000

10

Shane

Shane@ xyz.com Shane@ abc.co.in

1001

10

Shane

Target

60

IBM Software Group | WebSphere software

Slowly Changing Dimensions - Flag Current Source


Emp id Name Email

10

Shane

Shane@ abc.com

PM_PRIMA RYKEY

Emp id

Name

Email

PM_CURRENT_FLAG

Target

1000

10

Shane

Shane@ xyz.com Shane@ abc.co.in Shane@ abc.com

1001

10

Shane

1003

10

Shane

61

IBM Software Group | WebSphere software

Slowly Changing Dimensions Type II

Emp id

Name

Email

PM_PRI MARY KEY

Emp id

Name

Email

PM_BEG IN_DAT E

PM_EN D_DATE

10

Shane

Shane@xyz.c om

1000

10

Shane

Shane@x yz.com

01/01/00

Source Target
62

IBM Software Group | WebSphere software

Slowly Changing Dimensions -Effective Date Source


Emp id

Name

Email
Shane@ abc.co.in

10

Shane

PM_PRIMAR YKEY

Emp id

Name

Email

PM_BEGIN_D ATE

PM_END_D ATE

1000

10

Shane

Shane@x yz.com

01/01/00

03/01/00

1001

10

Shane

Shane@ abc.co.in

03/01/00

Target
63

IBM Software Group | WebSphere software

Slowly Changing Dimensions - Effective Date


Source
Emp id Name

Email
Shane@ abc.com

10

Shane

PM_PRIM ARYKEY

Emp id

Name

Email

PM_BEGIN_D ATE

PM_END_DA TE

1000

10

Shane

Shane@ xyz.com Shane@ abc.co.in Shane@ abc.com

01/01/00

03/01/00

1001

10

Shane

03/01/00

05/02/00

1003

10

Shane

05/02/00

Target
64

IBM Software Group | WebSphere software

Slowly Changing Dimensions Type III

PM_PRI MARYKE Y Emp id Name Email

Emp id

Name

Email

PM_Prev_ Column Name

PM_EFFEC T_DATE

10

Shane

Shane@xyz.c om

10

Shane

Shane@xyz. com

01/01/00

Source Target

65

Slowly Changing Dimensions Type III Source


Emp id Name

IBM Software Group | WebSphere software

Email
Shane@ abc.co.in

10

Shane

PM_PRIMAR YKEY

Emp id

Name

Email

PM_Prev_Colu mnName

PM_EFFEC T_DATE

10

Shane

Shane@ abc.co.in

Shane@xyz.co m

01/02/00

Target

66

IBM Software Group | WebSphere software

Slowly Changing Dimensions Type III

Source
Emp id Name

Email
Shane@ abc.com

10

Shane

PM_PRIM ARYKEY

Emp id

Name

Email

PM_Prev_Colu mnName

PM_EFFECT_ DATE

10

Shane

Shane@ abc.com

Shane@ abc.co.in

01/03/00

Target

67

IBM Software Group | WebSphere software

Degenerate Dimension
Dimension keys in fact table without corresponding dimension tables are called Degenerate Dimensions Purpose of Degenerate Dimensions 1. Generally used when each record in fact represents transaction line item 2. Useful for grouping transaction line items belonging to a single transaction

68

IBM Software Group | WebSphere software

Fast Changing Dimension


A fast changing dimension is a dimension whose attribute or attributes for a record (row) change rapidly over time. 1. Example: Age of associates, Income, Daily balance etc. 2. Technique to handle fast changing dimension: Create band tables

69

IBM Software Group | WebSphere software

Role Playing Dimension


A single dimension which is expressed differently in a fact table using views is called a role-playing dimension. This can be achieved by creating views on dimension table.

70

IBM Software Group | WebSphere software

Conformed Dimension
A conformed dimension means the same thing to each fact table to which it can be joined. Typically, dimension tables that are referenced or are likely to be referenced by multiple fact tables (multiple dimensional models) are called conformed dimensions
.

71

IBM Software Group | WebSphere software

Conformed Dimension Option #1

Identical dimensions with same keys, labels, definitions and Values

PRODUCT KEY

Sales Schema

Product Desc Brand Desc Category Desc

DATE KEY PRODUCT KEY STORE KEY PROMO KEY

SALES Facts

Inventory Schema

PRODUCT KEY Product Desc Brand Desc Category Desc

DATE KEY PRODUCT KEY STORE KEY

INVENTORY Facts

72

IBM Software Group | WebSphere software

Conformed Dimension Option #2


Subset of base dimension with common labels, definitions and values

PRODUCT KEY

Sales Schema
PROD KEY 12345

Product Desc Brand Desc Category Desc


Prod Desc Cherriors 10 Brand Desc Cherriors

DATE KEY PRODUCT KEY STORE KEY PROMO KEY

DATE KEY Day-of-week Week Desc Month Desc

SALES $
Category Desc Cereal MONTH KEY

BRAND KEY

Forecast Schema
BRAND KEY 12345

Brand Desc Category Desc


Brand Desc Cherriors Category Desc Cereal

MONTH KEY BRAND KEY

Month Desc

SALES $

73

IBM Software Group | WebSphere software

Garbage Dimension
A garbage dimension is a dimension that consists of low-cardinality columns such as codes, indicators, and status flags.

Approach to handle Garbage dimension:


Put the new attributes into existing dimension tables. Put the new attributes into the fact table. Create new separate dimension tables garbage dimension Create a separate Garbage Dimension table

74

IBM Software Group | WebSphere software

Junk Dimensions

Whether to use junk dimension


indicators, each has 3 values -> 243 (35) rows 5 indicators, each has 100 values -> 100 million (1005) rows
5

When to insert rows in the dimension


75

IBM Software Group | WebSphere software

Type of Fact Tables - Comparison


Feature
Grain

Transaction
One row per transaction.

Periodic
One row per time period

Cumulative
One row for the entire lifetime of an event.

Dimension Facts Database size

Date dimension at lowest level of granularity. Related to transaction activities. Largest size. At the most detailed grain level, tends to grow very fast.

Date dimension at the end-ofperiod granularity. Related to periodic activities. Smaller than Transaction fact table because grain of date & time dimension is significantly higher. No or very Low, primarily because data is already stored at a high aggregated level.

Multiple date dimensions. Related to activities which have a definite lifetime. Smallest in size when compared to Transaction and Periodic fact tables. Medium, because the data is primarily stored at the day level.

Need for aggregate tables

High, primarily because the data is stored at a very detailed level.

Performance

Performs well and can be improved by choosing a grain above the most detailed.

Performs better than other fact table types because data is stored at a less detailed grain.

Performs well.

IBM Software Group | WebSphere software

Factless Fact Tables


The two types of factless fact tables are: Coverage tables

Event tracking tables

77

Factless Fact Tables - Coverage Tables

IBM Software Group | WebSphere software

Coverage tables are required when a primary fact table is sparse Example: Tracking products in a store that did not sell

78

Factless Fact Tables - Event Tracking

IBM Software Group | WebSphere software

These tables are used for tracking a event: Example: Tracking student attendance

79

IBM Software Group | WebSphere software

Fact Constellation
Fact constellations: Multiple fact tables share dimension tables,viewed as
a collection of stars, therefore called galaxy schema or fact constellation

80

IBM Software Group | WebSphere software

What is a Data mart?


Data mart is a decentralized subset of data found either in a data warehouse or as a standalone subset designed to support the unique business unit requirements of a specific decision-support system. Data marts have specific business-related purposes such as measuring the impact of marketing promotions, or measuring and forecasting sales performance etc,.

Data Mart

Enterprise Data Warehouse

Data Mart

81

Data marts - Main| WebSphere software IBM Software Group Features


Main Features: Low cost Controlled locally rather than centrally, conferring power on the user group. Contain less information than the warehouse Rapid response Easily understood and navigated than an enterprise data warehouse. Within the range of divisional or departmental budgets

82

Advantages of Datamart over Datawarehouse IBM Software Group | WebSphere software

Datamart Advantages : Typically single subject area and fewer dimensions Limited feeds Very quick time to market (30-120 days to pilot) Quick impact on bottom line problems Focused user needs Limited scope Optimum model for DW construction Demonstrates ROI Allows prototyping

83

IBM Software Group | WebSphere software Disadvantages of Data Mart

Data Mart disadvantages : Does not provide integrated view of business information.

Uncontrolled proliferation of data marts results in redundancy


More number of data marts complex to maintain Scalability issues for large number of users and increased data volume

84

IBM Software Group | WebSphere software

DM - Types
Embedded data marts are marts that are stored within the central DW. They can be stored relationally as files or

cubes.
Dependent data marts are marts that are fed directly by the DW, sometimes supplemented with other feeds, such as external data. Independent data marts are marts that are fed directly by external sources and do not use the DW.

Data marts
85
85

IBM Software Group

The Operational Data Store

2007 IBM Corporation

IBM Software Group | WebSphere software

87

IBM Operational Data Store? Why We Need Software Group | WebSphere software

Need To obtain a system of record that contains the best data that exists in a legacy environment as a source of information Best here implies data to be Complete Up to date Accurate In conformance with the organizations information model

88

Operational Data Store - Group | WebSphere software IBM Software Insulated from OLTP

OLTP Server

ODS data resolves data integration issues

Data physically separated from production environment to insulate it from the processing demands of reporting and analysis

ODS

Access to current data facilitated.

Tactical Analysis

IBM Software Group | WebSphere software

Operational Data Store - Data


Detailed data
Records of Business Events (e.g. Orders capture)

Data from heterogeneous sources Does not store summary data Contains current data

90

IBM Software Group | WebSphere software

ODS- Benefits
Integrates the data Synchronizes the structural differences in data High transaction performance Serves the operational and DSS environment Transaction level reporting on current data

Flat files

60,5.2,JOHN 72,6.2,DAVID Operational Data Store

Relational Database

Excel files
91

IBM Software Group | WebSphere software

Operational Data Store- Update schedule

ODS Data

Data warehouse Data

Update schedule - Daily or less time frequency Detail of Data is mostly between 30 and 90 days Addresses operational needs

Weekly or greater time frequency Potentially infinite history

Address strategic needs

OLTP VsIBM Software Group | WebSphere software ODS Vs DWH


Characteristic Data redundancy OLTP Non-redundant within system; Unmanaged redundancy among systems Dynamic Field by field Highly structured, repetitive ODS Somewhat redundant with operational databases Data Warehouse Managed redundancy

Data stability Data update Data usage

Somewhat dynamic Static Field by field Somewhat structured, some analytical Moderate Somewhat stable Controlled batch Highly unstructured, heuristic or analytical Large to very large Dynamic
93

Database size

Moderate

Stable Database structure stability

IBM Software Group | WebSphere software

Star Schema Design

Single fact table surrounded by denormalized dimension tables The fact table primary key is the composite of the foreign keys (primary keys of dimension tables) Fact table contains transaction type information. Many star schemas in a data mart Easily understood by end users, more disk storage required

94

IBM Software Group | WebSphere software

EXAMPLE OF STAR SCHEMA

95

IBM Software Group | WebSphere software

Snowflake Schema

Single fact table surrounded by normalized dimension tables

Normalizes dimension table to save data storage space.


When dimensions become very very large Less intuitive, slower performance due to joins
May want to use both approaches, especially if supporting multiple enduser tools.

96

IBM Software Group | WebSphere software

Example of Snow flake schema

97

IBM Software Group | WebSphere software

Snowflake - Disadvantages
Normalization of dimension makes it difficult for user to understand Decreases the query performance because it involves more joins Dimension tables are normally smaller than fact tables - space may not be a major issue to warrant snowflaking

98

IBM Software Group | WebSphere software

Data Acquisation

Data Extraction
Data Transformation Data Loading

99
99

IBM Software Group | WebSphere software Representative DW Tools

Tool Category ETL Tools OLAP Server

Products ETI Extract, Informatica, IBM Visual Warehouse Oracle Warehouse Builder Oracle Express Server, Hyperion Essbase, IBM DB2 OLAP Server, Microsoft SQL Server OLAP Services, Seagate HOLOS, SAS/MDDB Oracle Express Suite, Business Objects, Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query, MetaCube Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft SQL Server, RedBricks SAS Enterprise Miner, IBM Intelligent Miner, SPSS/Clementine, TCS Tools

OLAP Tools

Data Warehouse Data Mining & Analysis

100

IBM Software Group | WebSphere software

ETL PRODUCTS

CODE BASED ETL TOOLS


GUI BASED ETL TOOLS

101
101

IBM Software Group | WebSphere software

CODE BASED ETL TOOLS


SAS ACCESS

SAS BASE
TERADATA ETL TOOLS 1. BTEQ 2. TPUMP

3. FAST LOAD
4. MULTI LOAD

102

IBM Software Group | WebSphere software

GUI BASED ETL TOOLS


Informatica DT/Studio

Data Stage
Business Objects Data Integrator (BODI) AbInitio Data Junction

Oracle Warehouse Builder


Microsoft SQL Server Integration Services IBM DB2 Ware house Center

103

IBM Software Group

Extraction Types

2007 IBM Corporation

IBM Software Group | WebSphere software

Extraction Types
Extraction

Full Extract

Periodic/ Incremental Extract

105

IBM Software Group | WebSphere software

Full Extract

New data

Data Mart

Full Extract Source System

106

IBM Software Group | WebSphere software

Incremental Extract

Existing data Incremental Data Data Mart

Incremental Extract Source System

108

IBM Software Group | WebSphere software

Incremental Extract

Existing data
New data Incremental Data Data Mart

Source System

Incremental Extract

Changed data

109

IBM Software Group | WebSphere software

Incremental Extract

New data Incremental Data

Incremental addition to data mart


Data Mart

Source System Incremental Extract

Changed data

Existing data updated using changed data

110

IBM Software Group | WebSphere software

DATAWARE LOADING

111

IBM Software Group | WebSphere software

Types of Data warehouse Loading

Target update types


Insert Update

112

IBM Software Group | WebSphere software

Types of Data Warehouse Updates

Data Warehouse

Insert Source data

Data Staging Full Replace Selective Replace Update plus Retain History Update

Point in Time Snapshots


New Data Changed Data

IBM Software Group | WebSphere software

New Data and Point-In-Time Data Insert


Source data

New data OR Point-in-Time Snapshot (e.g.. Monthly) New Data Added to Existing Data

IBM Software Group | WebSphere software

Changed Data Insert


Source data Changed Data Added to Existing Data

Changed data

IBM Software Group | WebSphere software

Data Warehouse Life cycle


Business Requirement

ETL

Data Ware house

Info Access

Reporting tools Map Req. to OLTP Enterprise Data Warehouse Reverse Engg. OLAP External Data Storage Web Browsers

Mining

OLTP System

Map Data sources

116

IBM Software Group | WebSphere software

Project Life Cycle


Software Requirement Specification High level Design(HLD)

Low level Design(LLD)


Development Unit Testing System Integration Testing

Peer Review
User Acceptance Testing Production Maintenance

117
117

IBM Software Group

Meta Data in a Data Warehouse

2007 IBM Corporation

IBM Software Group | WebSphere software

What is Metadata?
Data about data and the processes Metadata is stored in a data dictionary and repository. Insulates the data warehouse from changes in the schema of operational systems. It serves to identify the contents and location of data in the data warehouse

119

IBM Software Group | WebSphere software

Why Do You Need Meta Data?


Share resources Users Tools Document system

Without meta data Not Sustainable Not able to fully utilize resource

120

IBM Software Group | WebSphere software The Role of Meta Data in the Data Warehouse

Meta Data enables data to become information, because with it you

Know what data you have and You can trust it!

IBM Software Group | WebSphere software

Meta Data Answers.


How have business definitions and terms changed over time?
How do product lines vary across organizations? What business assumptions have been made? How do I find the data I need? What is the original source of the data? How was this summarization created? What queries are available to access the data

Meta IBM Software Group | WebSphere software Data Process


Integrated with entire process and data flow
Populated from beginning to end Begin population at design phase of project Dedicated resources throughout
Build Maintain

Design Mapping

Extract Scrub Transform

Load Index Aggregation

Replication Data Set Distribution

Access & Analysis Resource Scheduling & Distributio

Meta Data System Monitoring


123

IBM Software Group | WebSphere software

Types of ETL Meta Data


ETL Meta data

Technical Meta data

Operational Meta data

124

IBM Software Group | WebSphere software

Classification of ETL Meta Data


Data Warehouse Meta data This Meta data stores descriptive information about the physical implementation details of data warehouse.

Source Meta data This Meta data stores information about the source data and the mapping of source data to data warehouse data

IBM Software Group | WebSphere software

ETL Meta Data


Transformations & Integrations. This Meta data describes comprehensive information about the Transformation and loading.

Processing Information This Meta data stores information about the activities involved in the processing of data such as scheduling and archives etc

End User Information


This Meta data records information about the user profile and security.

IBM Software Group | WebSphere software

ETL -Planning for the Movement

The following may be helpful for planning the movement Develop a ETL plan Specifications Implementation

127

Vous aimerez peut-être aussi