Académique Documents
Professionnel Documents
Culture Documents
4/28/2012
TCS Confidential
2
2
These are the backbone systems of any enterprise, such as order entry inventory
etc.
The classic examples are airline reservations, credit-card authorizations, and ATM withdrawals etc.,
Data Warehouse
Query Processing Random CPU Usage History Oriented Managerial View Denormalized Design for Query Processing
Data Warehouse
Designed for quite or static database Organized by subject (Customer, Product)
Relatively smaller database Large database size Many concurrent users Volatile Data Relatively few concurrent users Non Volatile Data
Operational System
Stores all data Performance Sensitive Not Flexible Efficiency
Data Warehouse
Stores relevant data Less Sensitive to performance Flexible Effectiveness
10
Entry
Sales Rep Quantity Sold Part Number Date Customer Name Product Description Unit Price Mail Address
Transactional Storage
Integration of Data
Encoding
Appl. A - M, F Appl. B - 1, 0 Appl. C - X, Y Appl. A - pipeline cm. Appl. B - pipeline inches Appl. C - pipeline mcf Appl. A - balance dec(13,2) Appl. B - balance PIC 9(9)V99 Appl. C - balance float Appl. A - bal-on-hand Appl. B - current_balance Appl. C - balance Appl. A - date (Julian) Appl. B - date (yymmdd) Appl. C - date (absolute) M, F
Integration
Unit of Attributes
pipeline cm
balance dec(13, 2)
balance
date (Julian)
Transactional Storage
Volatility of Data
Volatile
Insert Change
Non-Volatile
Delete Insert
Change Access Record-by-Record Data Manipulation Mass Load / Access of Data Load
Access
Transactional Storage
Transactional Storage
Data warehouse
Update Insert
Update Delete Insert
Load/ Update
Initial Load
Constant Change
Updated constantly
Does NOT mean the Data warehouse is never updated or never changes!!
16
DW Implementation Approaches
Top Down Bottom-up
Combination of both
Choices depend on:
current infrastructure resources architecture ROI Implementation speed
17
Staging
Enterprise Datawarehouse
18
Staging
Enterprise Datawarehouse
19
Data Mart Bus: Conformed facts and dims Data Mart #1 Dimensional Atomic AND summery data Business Process Centric Design Goals: Easy-of -use Query Performance Data Mart #2 Ad Hoc Query Tools Report Writers
Extract
Dimensions No user query support Data Store: Flat files or relational tables
Load
Extract
Source System
Presentation Area
Bottom Up Approach
Integrated Data Timely User Access Conformed Dimensions Single Process to Build Dimension
E/R Design or Flat File Retain History Needed for regular processing No end user access
Data Warehouse
Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
Dimensional Transaction & Summary data Data Mart Single subject area (i.e. Fact table) Multiple Marts May exist in a Single Database Instance
21
Extract
Access
Extract
Access
Source System
DWH
Presentation Area
Data Warehouse
E/R Model Subject Areas Transaction Level Detail Historical Persistency As justified- Archive for Retrieval if Needed
Most are dimensional Data Mart Design by Business Function Summary Level Data
Data Mart
Flat File
Data Mart
23
DW Implementation Approaches
Top Down
More planning and design initially Involve people from different workgroups, departments Data marts may be built later from Global DW
Bottom Up
24
DW Implementation Approaches
Top Down
Consistent data definition and enforcement of business rules across enterprise
High cost, lengthy process, time consuming
Bottom Up
Data redundancy and inconsistency between data marts may occur Integration requires great planning Less cost of H/W and other resources Faster pay-back
Works well when there is centralized IS department responsible for all H/W and resources
25
DW Architectures
26
26
Data Sources Software Group |Software software ETL WebSphere Data Stores IBM
Transaction Data Prod
S T A G I N G A R E A O P E R A T I O N A L D A T A
Users
IBM IMS
Ascential
SQL
ANALYSTS
Mkt
Cognos Teradata IBM Load DATASTAGE Data Warehouse Data Marts Finance Essbase Marketing Meta Data Queries,Reporting, DSS/EIS, Data Mining EXECUTIVES Micro Strategy Sales Microsoft Siebel Business Objects Web Browser CUSTOMERS/ SUPPLIERS 27 OPERATIONAL PERSONNEL SAS MANAGERS
HR
VSAM
Fin
Oracle
Extract
Acctg
Sybase
SAP
Sagent
Informix
SAS
External Data
Demographic
HarteHanks
S T O R E
Benefits of DWH
To formulate effective business, marketing
Data Modeling
29
Data Modeling
Helps to visualize the business A model is a means of communication. Models help elicit and document requirements. Models reduce the cost of change. Model is the essence of DW architecture based on which DW will be implemented
30
IBM Software want to do with What do weGroup | WebSphere software the data?
Model depends on what kind of data analysis we want to do: Different Data Analysis Techniques
Query and reporting
Display Query Results
Multidimensional analysis
Analyse data content by looking at it in different perspectives
Data mining
discover patterns and clustering attributes in data
Fast and easy access to data Any number of analysis dimensions in any combinations ER will mean many joins Dimensional model appropriate
Analysis
DATAWAREHOUSE/DATAMART YES!
Levels of modeling Conceptual modeling Describe data requirements from a business point of view without technical details
Logical modeling Refine conceptual models Data structure oriented, platform independent Physical modeling Detailed specification of what is physically implemented using specific technology
36
Modeling Techniques
Entity-Relationship Modeling Traditional modeling technique Technique of choice for OLTP Suited for corporate data warehouse Dimensional Modeling Analyzing business measures in the specific business context Helps visualize very abstract business questions End users can easily understand and navigate the data structure
37
and association
described by a verb Cardinality 1-1 1-M
M-M
Example : Books belong to Printed Media
38
Entity-Relationship Modeling - Basic Concepts Attributes Characteristics and properties of entities Example : Book Id, Description, book category are attributes of entity Book Attribute name should be unique and selfexplanatory Primary Key, Foreign Key, Constraints are defined on Attributes
39
Entity
IBM ReviewSoftware Group | WebSphere software Terms & Symbols of Logical Modeling
{
Relationship
Factory Factory ID
Examples: ER Model
44
44
45
Dimensional Modeling
46
46
Dimensional Modeling
Dimensional modeling uses three basic concepts : measures, facts, dimensions. Is powerful in representing the requirements of the business user in the context of database tables. Focuses on numeric data, such as values counts, weights, balances and occurences.
47
What is a Facts
A fact is a collection of related data items, consisting of measures and context data. Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business process.
Facts are measured, continuously valued, rapidly changing information. Can be calculated and/or derived.
Granularity The level of detail of data contained in the data warehouse e.g. Daily item totals by product, by store
48
Types of Facts
Additive Able to add the facts along all the dimensions Discrete numerical measures eg. Retail sales in $ Semi Additive Snapshot, taken at a point in time Measures of Intensity Not additive along time dimension eg. Account balance, Inventory balance Added and divided by number of time period to get a time-average Non Additive Numeric measures that cannot be added across any dimensions Intensity measure averaged across all dimensions eg. Room temperature Textual facts - AVOID THEM
49
Dimensions
A dimension is a collection of members or units of the same type of views. Dimensions determine the contextual background for the facts.
Dimensions represent the way business people talk about the data resulting from a business process, e.g., who, what, when, where, why, how
50
Dimensional Hierarchy
Geography Dimension
World Level
World America USA FL Miami GA Tampa VA Europe Canada CA WA Naples Dimension Member /
Business Entity
51
51
Continent Level
Asia Argentina
Country Level
State Level
City Level
Orlando
Dimensions Types Conformed Dimension Junk Dimension Fast Changing Dimension Role Playing Dimension
Garbage Dimension
Slowly Changing Dimension Degenerated Dimension
52
52
Since these changes are smaller in magnitude compared to changes in fact tables, these dimensions are known as slowly growing or slowly changing dimensions.
53
Slowly changing dimensions are classified into three different types TYPE I TYPE II TYPE III
54
Target
Name Email
Shane
Shane@xyz.com
Source
Emp id Name Email Emp id
Target
Name Email
1001
Shane
Shane@ abc.co.in
1001
Shane
Shane@ abc.co.in
Shane@ xyz.com
55
Source
Emp id Name Email PM_PRI MARY KEY Emp id
Target
Name Email PM_VER SION_N UMBER
10
Shane
56
10
Shane
Shane@ abc.co.in
PM_PRIMA RYKEY
Emp id
Name
PM_VERSION_NUMBER
1000
10
Shane
1001
10
Shane
Target
57
Source
Emp id Name Email
10
Shane
Shane@ abc.com
PM_PRIM ARYKEY
Emp id
Name
PM_VERSION_NUM BER
Target
1000
10
Shane
Shane@ xyz.com
Shane@ abc.co.in Shane@ abc.com
1001
10
Shane
1003
10
Shane
58
Emp id
Name
Emp id
Name
PM_CUR RENT_FL AG
10
Shane
Shane@xyz.co m
1000
10
Shane
Source Target
59
10
Shane
Shane@ abc.co.in
PM_PRIMA RYKEY
Emp id
Name
PM_CURRENT_FLAG
1000
10
Shane
1001
10
Shane
Target
60
10
Shane
Shane@ abc.com
PM_PRIMA RYKEY
Emp id
Name
PM_CURRENT_FLAG
Target
1000
10
Shane
1001
10
Shane
1003
10
Shane
61
Emp id
Name
Emp id
Name
PM_BEG IN_DAT E
PM_EN D_DATE
10
Shane
Shane@xyz.c om
1000
10
Shane
Shane@x yz.com
01/01/00
Source Target
62
Name
Email
Shane@ abc.co.in
10
Shane
PM_PRIMAR YKEY
Emp id
Name
PM_BEGIN_D ATE
PM_END_D ATE
1000
10
Shane
Shane@x yz.com
01/01/00
03/01/00
1001
10
Shane
Shane@ abc.co.in
03/01/00
Target
63
Email
Shane@ abc.com
10
Shane
PM_PRIM ARYKEY
Emp id
Name
PM_BEGIN_D ATE
PM_END_DA TE
1000
10
Shane
01/01/00
03/01/00
1001
10
Shane
03/01/00
05/02/00
1003
10
Shane
05/02/00
Target
64
Emp id
Name
PM_EFFEC T_DATE
10
Shane
Shane@xyz.c om
10
Shane
Shane@xyz. com
01/01/00
Source Target
65
Email
Shane@ abc.co.in
10
Shane
PM_PRIMAR YKEY
Emp id
Name
PM_Prev_Colu mnName
PM_EFFEC T_DATE
10
Shane
Shane@ abc.co.in
Shane@xyz.co m
01/02/00
Target
66
Source
Emp id Name
Email
Shane@ abc.com
10
Shane
PM_PRIM ARYKEY
Emp id
Name
PM_Prev_Colu mnName
PM_EFFECT_ DATE
10
Shane
Shane@ abc.com
Shane@ abc.co.in
01/03/00
Target
67
Degenerate Dimension
Dimension keys in fact table without corresponding dimension tables are called Degenerate Dimensions Purpose of Degenerate Dimensions 1. Generally used when each record in fact represents transaction line item 2. Useful for grouping transaction line items belonging to a single transaction
68
69
70
Conformed Dimension
A conformed dimension means the same thing to each fact table to which it can be joined. Typically, dimension tables that are referenced or are likely to be referenced by multiple fact tables (multiple dimensional models) are called conformed dimensions
.
71
PRODUCT KEY
Sales Schema
SALES Facts
Inventory Schema
INVENTORY Facts
72
PRODUCT KEY
Sales Schema
PROD KEY 12345
SALES $
Category Desc Cereal MONTH KEY
BRAND KEY
Forecast Schema
BRAND KEY 12345
Month Desc
SALES $
73
Garbage Dimension
A garbage dimension is a dimension that consists of low-cardinality columns such as codes, indicators, and status flags.
74
Junk Dimensions
Transaction
One row per transaction.
Periodic
One row per time period
Cumulative
One row for the entire lifetime of an event.
Date dimension at lowest level of granularity. Related to transaction activities. Largest size. At the most detailed grain level, tends to grow very fast.
Date dimension at the end-ofperiod granularity. Related to periodic activities. Smaller than Transaction fact table because grain of date & time dimension is significantly higher. No or very Low, primarily because data is already stored at a high aggregated level.
Multiple date dimensions. Related to activities which have a definite lifetime. Smallest in size when compared to Transaction and Periodic fact tables. Medium, because the data is primarily stored at the day level.
Performance
Performs well and can be improved by choosing a grain above the most detailed.
Performs better than other fact table types because data is stored at a less detailed grain.
Performs well.
77
Coverage tables are required when a primary fact table is sparse Example: Tracking products in a store that did not sell
78
These tables are used for tracking a event: Example: Tracking student attendance
79
Fact Constellation
Fact constellations: Multiple fact tables share dimension tables,viewed as
a collection of stars, therefore called galaxy schema or fact constellation
80
Data Mart
Data Mart
81
82
Datamart Advantages : Typically single subject area and fewer dimensions Limited feeds Very quick time to market (30-120 days to pilot) Quick impact on bottom line problems Focused user needs Limited scope Optimum model for DW construction Demonstrates ROI Allows prototyping
83
Data Mart disadvantages : Does not provide integrated view of business information.
84
DM - Types
Embedded data marts are marts that are stored within the central DW. They can be stored relationally as files or
cubes.
Dependent data marts are marts that are fed directly by the DW, sometimes supplemented with other feeds, such as external data. Independent data marts are marts that are fed directly by external sources and do not use the DW.
Data marts
85
85
87
IBM Operational Data Store? Why We Need Software Group | WebSphere software
Need To obtain a system of record that contains the best data that exists in a legacy environment as a source of information Best here implies data to be Complete Up to date Accurate In conformance with the organizations information model
88
Operational Data Store - Group | WebSphere software IBM Software Insulated from OLTP
OLTP Server
Data physically separated from production environment to insulate it from the processing demands of reporting and analysis
ODS
Tactical Analysis
Data from heterogeneous sources Does not store summary data Contains current data
90
ODS- Benefits
Integrates the data Synchronizes the structural differences in data High transaction performance Serves the operational and DSS environment Transaction level reporting on current data
Flat files
Relational Database
Excel files
91
ODS Data
Update schedule - Daily or less time frequency Detail of Data is mostly between 30 and 90 days Addresses operational needs
Somewhat dynamic Static Field by field Somewhat structured, some analytical Moderate Somewhat stable Controlled batch Highly unstructured, heuristic or analytical Large to very large Dynamic
93
Database size
Moderate
Single fact table surrounded by denormalized dimension tables The fact table primary key is the composite of the foreign keys (primary keys of dimension tables) Fact table contains transaction type information. Many star schemas in a data mart Easily understood by end users, more disk storage required
94
95
Snowflake Schema
96
97
Snowflake - Disadvantages
Normalization of dimension makes it difficult for user to understand Decreases the query performance because it involves more joins Dimension tables are normally smaller than fact tables - space may not be a major issue to warrant snowflaking
98
Data Acquisation
Data Extraction
Data Transformation Data Loading
99
99
Products ETI Extract, Informatica, IBM Visual Warehouse Oracle Warehouse Builder Oracle Express Server, Hyperion Essbase, IBM DB2 OLAP Server, Microsoft SQL Server OLAP Services, Seagate HOLOS, SAS/MDDB Oracle Express Suite, Business Objects, Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query, MetaCube Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft SQL Server, RedBricks SAS Enterprise Miner, IBM Intelligent Miner, SPSS/Clementine, TCS Tools
OLAP Tools
100
ETL PRODUCTS
101
101
SAS BASE
TERADATA ETL TOOLS 1. BTEQ 2. TPUMP
3. FAST LOAD
4. MULTI LOAD
102
Data Stage
Business Objects Data Integrator (BODI) AbInitio Data Junction
103
Extraction Types
Extraction Types
Extraction
Full Extract
105
Full Extract
New data
Data Mart
106
Incremental Extract
108
Incremental Extract
Existing data
New data Incremental Data Data Mart
Source System
Incremental Extract
Changed data
109
Incremental Extract
Changed data
110
DATAWARE LOADING
111
112
Data Warehouse
Data Staging Full Replace Selective Replace Update plus Retain History Update
New data OR Point-in-Time Snapshot (e.g.. Monthly) New Data Added to Existing Data
Changed data
ETL
Info Access
Reporting tools Map Req. to OLTP Enterprise Data Warehouse Reverse Engg. OLAP External Data Storage Web Browsers
Mining
OLTP System
116
Peer Review
User Acceptance Testing Production Maintenance
117
117
What is Metadata?
Data about data and the processes Metadata is stored in a data dictionary and repository. Insulates the data warehouse from changes in the schema of operational systems. It serves to identify the contents and location of data in the data warehouse
119
Without meta data Not Sustainable Not able to fully utilize resource
120
IBM Software Group | WebSphere software The Role of Meta Data in the Data Warehouse
Know what data you have and You can trust it!
Design Mapping
124
Source Meta data This Meta data stores information about the source data and the mapping of source data to data warehouse data
Processing Information This Meta data stores information about the activities involved in the processing of data such as scheduling and archives etc
The following may be helpful for planning the movement Develop a ETL plan Specifications Implementation
127