Académique Documents
Professionnel Documents
Culture Documents
Contents
Data Warehouse Concepts Data Modeling Dimensional Modeling Implementation and Maintenance Data Management Data Quality Analysis Metadata Management Data Governance Master Data Management Data Storage, Movement and Access
5 November 2012
5 November 2012
5 November 2012
5 November 2012
5 November 2012
Data Warehouse is a Subject-Oriented Integrated Time-Variant Non-volatile collection of data enabling management decision making
5 November 2012
Subject Orientation
Process Oriented Subject Oriented
Entry
Sales Rep Quantity Sold Part Number Date Customer Name Product Description Unit Price Mail Address
Sales
Customers
Products
Transactional Storage
5 November 2012
Data Volatility
Volatile
Insert Change
Non-Volatile
Access
Transactional Storage
5 November 2012
Time Variance
Current Data Historical Data
Transactional Storage
5 November 2012
10
5 November 2012
11
5 November 2012
12
5 November 2012
13
Cleansing
Transformation Aggregation Summarization
DM1 DM2
. . .
FSn
Transmission
N E T W O R K
ODS
DW
DMn A R E A
Legacy System
5 November 2012
14
5 November 2012
15
5 November 2012
16
5 November 2012
17
REPORTING TOOL
U S E R S
External
5 November 2012
18
Legacy
Client/ Server
Select
REPORTING TOOL
Metadata Repository
Extract
Transform Integrate
DATA WAREHOUSE
OLTP External
A P I
U S E R S
Maintain
5 November 2012
19
Data Marts
Legacy
Client/ Server
Select
REPORTING TOOL
Metadata Repository
Extract
Transform
DATA MART
OLTP External
Integrate
A P I
U S E R S
Maintain
5 November 2012
20
Legacy
Client/ Server
Select
Data Mart
REPORTING TOOL
Extract
Transform
Data Mart
OLTP External
Integrate
A P I
U S E R S
Maintain
Data Mart
5 November 2012
21
REPORTING TOOL
Select
Client/ Server
Extract
Metadata Repository
Transform
OLTP
Data Mart
Integrate
DATA WAREHOUSE
A P I
U S E R S
External
Maintain
Data Mart
5 November 2012
22
Select
Data Mart
Metadata Repository
REPORTING TOOL
Extract
Transform Integrate
Data Mart
DATA WAREHOUSE
A P I
U S E R S
Maintain
Data Mart
5 November 2012
23
Metadata
Cont.
5 November 2012 24
Metadata
5 November 2012
25
5 November 2012
27
5 November 2012
28
Representative DW Tools
Tool Category ETL Tools OLAP Server Products ETI Extract, Informatica, IBM Visual Warehouse Oracle Warehouse Builder Oracle Express Server, Hyperion Essbase, IBM DB2 OLAP Server, Microsoft SQL Server OLAP Services, Seagate HOLOS, SAS/MDDB Oracle Express Suite, Business Objects, Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query, MetaCube Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft SQL Server, RedBricks SAS Enterprise Miner, IBM Intelligent Miner, SPSS/Clementine, TCS Tools
OLAP Tools
5 November 2012
29
5 November 2012
30
Data Modeling
5 November 2012
31
Modeling ER Model
Definition
Logical & Graphical representation of the information needs
Process
Classifying
Entities Characterizing Attributes Inter-relating Relationships
5 November 2012
32
Features Represent business requirement completely, correctly & consistently Remove redundancy Does not presuppose data granularity Not implemented
5 November 2012
33
5 November 2012
34
Features
Optimized Efficient Buildable Robust
5 November 2012
35
Dimensional Modeling
Form of analytical design (or physical model) in which data is pre-classified as a fact or dimension
Give this periods total sales volume and revenue by product, business unit and package
5 November 2012
36
Dimensional Modeling
5 November 2012
37
5 November 2012
38
RDBMS ERP
Client Browsers
ODS
Extraction
CRM
STAGING AREA
Network
DW
Data
Mainframe DBs
Marts
Aggregation, summarization, Data Mart Population, Dimension loading, Fact Loading Reports, Cubes, Analysis, Data mining, Dashboards, MIS reports, Company Quarterly reports etc..
5 November 2012 39
PC DBs
RDBMS
ERP
Client Browsers
ODS
Extraction
CRM
STAGING AREA
Network
DW
Data
Mainframe DBs
Marts
PC DBs
5 November 2012 40
Dimensional Modeling
Form of analytical design (or physical model) in which data is pre-classified as a fact or dimension
5 November 2012
41
5 November 2012
42
Standard approaches for common modeling situations Slowly Changing Dimensions (SCDs) Heterogeneous products (e.g. Savings account, current account) Pay-in-advance databases Event-handling databases (fact-less facts)
5 November 2012
43
What is a Fact?
Facts Measure
Sales Volume Revenue
5 November 2012
44
Sales
Date Key (int) Store Key (int)
Billing
Date Key (int) Customer Key (int) Service Line Key (int) Rate Plan Key (int) Number of Total Minutes Number of Calls (int) Service Charge (float) Taxes (float)
Revenue
Cost
Product Key (int) Sales (float) Qty Sold (int) Price (float)
No. of Accounts
Discount (float)
The term FACT represents a single business measure. E.g. Sales, Qty Sold Each fact has a GRAIN which is the set of perspectives or attributes that define/ qualify the fact completely. E.g. Grain of Sales could be for each PRODUCT, at each STORE, on each/ every DAY. A FACT TABLE is the primary table in a dimensional model where the business measures or FACTS are stored. A business measure or FACT is a row in a FACT TABLE. All FACTS in a FACT TABLE must be at the SAME GRAIN.
5 November 2012 45
5 November 2012
46
5 November 2012
47
What is a Dimension?
Data Warehouse is Subject-Oriented
Subject
Product
Dimension
Business Unit
Package
5 November 2012
48
Store
Store Key (int) Store name (char) Street Address (char) City (Char) State (Char) Region (Char) Country (Char)
Product
Product Key (int) Product id (char/int) Product name (char) Product Group Brand Department
Customer Geography
Time
The term DIMENSION represents a single category or perspective by which associated FACTS are interpreted and understood.
E.g. Store is a perspective by which sales are understood. It is the answer to the question Where did the sales occur?
A DIMENSION TABLE is a table which holds a list of attributes or qualities of the dimension most often used in queries and reports. E.g. The Store dimension can have attributes such as the street and block number, the city, the region and the country where it is located in addition to its name. Every row in the DIMENSION TABLE represents a unique instance of that DIMENSION and has a unique identifier called the DIMENSION KEY.
5 November 2012 49
5 November 2012
50
Member
Attribute Grain
America
Europe
Asia
Canada
Argentina
FL
GA
VA
CA
WA
Tampa
Orlando
Developing
Third world
Middle
Lower
Town
Village
5 November 2012
52
Store
Store Key (int) Store name (char) Street Address (char) City (Char) State (Char) Region (Char) Country (Char)
Product
Product Key (int) Product id (char/int) Product name (char) Product Group Brand Department
Sales
Date Key (int) Store Key (int) Product Key (int) Customer Key (int) Payment Type Key (int) Sales (float) Qty Sold (int) Price (float)
Store
Discount (float)
Time
Store Sales
Payment Type
The Star Join Schema or STAR SCHEMA is a single FACT TABLE joined to a set of DIMENSION TABLES Simple, Symmetric, Extensible and Optimized! GRAIN of the Star Schema is the grain of its central Fact table!
5 November 2012 53
Customer
Product
Star Schema
5 November 2012
54
Star Schema
Fact Table
This table is the core of the Star Schema Structure and contains the Facts or Measures available through the Data Warehouse.
These Facts answer the questions of What, How Much, or How Many.
Some Examples:
Sales Dollars, Units Sold, Gross Profit, Expense Amount, Net Income, Unit Cost, Number of Employees, Turnover, Salary, Tenure, etc.
5 November 2012
55
Star Schema
Dimension Tables
These tables describe the Facts or Measures. These tables contain the Attributes and may also be Hierarchical. These Dimensions answer the questions of Who, What, When, or Where. Some Examples:
Day, Week, Month, Quarter, Year Sales Person, Sales Manager, VP of Sales Product, Product Category, Product Line Cost Center, Unit, Segment, Business, Company
5 November 2012
56
Star Schema
Employee_Dim EmployeeKey EmployeeID . . .
Sales_Fact TimeKey EmployeeKey ProductKey CustomerKey ShipperKey Required Data (Business Metrics) or (Measures) . . .
5 November 2012
57
5 November 2012
58
5 November 2012
59
5 November 2012
60
Financial Transactions Trans. Amount No. of Bonds No. of Transactions Service Cost ...
Non-Financial Transactions
No. of Cheques Cleared No. of Visits to a Branch No. of DEMAT Transactions ...
5 November 2012
61
Customer
Product
Time
Financial Transactions
Organization
Channel
5 November 2012
62
Customer (Customer)
Product (Scheme)
Time (Day-Hour)
Financial Transactions
Organization (Branch)
Channel (Channel)
5 November 2012
63
5 November 2012
64
Types of Dimensions
Primary Dimension Contributes to the fact grain A set of these uniquely defines the associated fact E.g. SALES fact is typically completely defined by store, product and time Secondary Dimension Does NOT contribute to the fact grain Non-primary dimensions such as payment type, customer, manufacturer are still important for analysis of SALES fact Useful for rich analytic slicing and dicing, e.g. Top 10 customers. Degenerate Dimension A dimension without any attributes; but useful for analysis Generally included in the associated fact table before facts E.g. invoice number, by itself, in a shipping fact
5 November 2012
65
Types of Dimensions
Conformed Dimension A dimension used across the enterprise Requires standardized structure and definition Requires to be designed upfront before individual designing schemas Plugs into multiple stars as either primary or secondary dimension E.g. Customer, Product, Store, Time, Employee Customer could be captured at the store card swiping machine (sales fact), be part of Marketing promotion strategy (campaign fact) and also could be serviced by call center for warranty replacements (warranty fact) Employee may be a Sales-rep claiming credit for sales (Sales fact) or may be a Finance manager authorizing vendor payments (vendor Payment fact) or a call center person taking customer calls (Service Call fact)
5 November 2012
66
Types of Dimensions
Slowly Changing Dimension Dimensional attributes change over time Need to capture these changing realities as history Requires special design techniques to keep it single valued for each fact row while still retaining history E.g. Customer City, Marriage status, salesrep department Type 1 (Overwrite previous values) Type 2 (Create additional time-stamped dimension record Type 2 automatically partitions history Type 3 (Create additional attribute column to retain any one previous value e.g. first value, previous value) Requires dimension key to be generalized
5 November 2012
67
Types of Dimensions
Rapidly Changing Small Dimension Same as SCD except frequency is higher Need to track changes to attributes E.g. Employee attributes such as appraisal rating E.g. telecom product: rate plans keep changing Large Dimension Size increases with decreasing level of granularity Typical of public utility companies, government agencies Human records kept by supermarkets e.g. Shoppers Stop Do NOT create SCDs to address slow changes/ history See Monster Dimension for SCD strategy Choose indexing strategies to reduce query run times Choose RDBMS wisely e.g. SQLServer vs. Oracle vs. Teradata
5 November 2012
68
Types of Dimensions
Rapidly Changing Monster Dimension Similar to large dimension but typical of a large insurance customer dimension Customers and Claims are rapidly created and changed Need to track history for credit and legal reasons Remove the continuously changing attributes to another dimension table e.g. demographic Reduce the cardinality of these attributes by banding them e.g. income_band, credit_band, etc. Then create all possible combinations of these attributes Then assign a dimension key to each unique set of these combinations; this is the demographics table For each combination that represents the customers status in a particular period, plug the demographic key into the fact as an additional key
5 November 2012
69
Types of Dimensions
Junk Dimension A convenient grouping of random flags and attributes to get them out of the fact table Retain only useful fields Remove fields that make no sense at all Remove fields that are inconsistently filled Remove fields that are of operational interest only Design similar to demographics; maximum unique combinations, assign integer key, plug into fact Create new combination (insert new dimension record) at ETL run-time E.g. Yes/No Flags in old retail transaction data
5 November 2012
70
Types of Dimensions
Role-playing Dimension dimension appears several times in the same fact table Typically, Date/Time dimension plays many roles E.g. Order Fulfillment is a typical retail fact table having the following dimensions: Order Date, Packaging date, Shipping date, Delivery date Create one fact table key for each role Create one SQL view of the dimension for each role Use view names to run SQL queries In Business Objects, this scenario is designed using Aliases and Contexts E.g. (2) Employee dimension: Salesrep, Manager, Appraiser, Appraisee in Sales Compensation fact and Employee Appraisal fact respectively.
5 November 2012
71
Dimensional Hierarchy
Geography Dimension
World Level
FL
Miami
GA
Tampa
VA
CA
Orlando
WA
Naples
5 November 2012
72
5 November 2012
73
Types of Facts
Value Based Classification
Numeric Facts
Count / Occurrence Based (e.g. Employees assigned to a project) Non-numeric Facts (e.g. Comments in fact tables)
Figure 2
The measurements group nicely together into a single fact table with the same grain. Periodic Snapshot: A snapshot is a measurement of status at a specific point in time. E.g. In Figure 2, earned premium is the fraction of the total policy premium that the insurance company can book as revenue during the particular reporting period. The periodic-snapshot-grained fact table represents a predefined time span.
5 November 2012 76
5 November 2012
78
Figure 1
The location dimension and product dimension have their own two levels of hierarchy. For example, the location dimension has the region level and plant level. In each dimension, there are members such as the east region and west region of the location dimension. Although not shown in the figure, the time dimension has its numbers, such as 1996 and 1997. Each sub-cube has its own numbers, which represent the volume of production as a measurement. For example, in a specific time period (not expressed in the figure), the Armonk plant in East region has produced 11,000 Cell Phones, of model number 1001.
5 November 2012 79
This Matrix forces us to list all the possible data marts we could possibly build and name all the dimensions that are present in those data marts (at a high level).
A Dimensional Model is made up of one or more star schemas. Some of these star schemas may be snow flaked for better organization and storage.
Dimension
Organization Equipment Employee Customer Accounts Calendar
5 November 2012 80
Outage
Vendor
5 November 2012
81
Country Industry Segment Industry Type State City Customer Fin. Class
Sales
Date Key (int) Store Key (int) Product Key (int) Customer Key (int) Sales (float) Qty Sold (int) Figure: Customer Dimension Hierarchies (Industry, Geography)
Price (float)
Discount (float)
Choose each fact for the fact table making sure that the fact is relevant and also has the same grain has the fact table e.g. for SALES fact table, typical facts would be price, quantity sold and sales amount as these are all dimensioned by product, by store, by day.
5 November 2012 82
Time (Day-Hour)
Financial Transactions
Organization (Branch)
Channel (Channel)
Important Notes:
1. Each dimension table will have a MEANINGLESS SINGLE PART INTEGER PRIMARY KEY called surrogate key. This key also occurs as part of the central fact primary key.
2.
All fact table primary keys should ideally be foreign keys to the corresponding dimension tables.
5 November 2012
83
5 November 2012
84
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse
5 November 2012 85
5 November 2012
86
Naming Standards
Database Objects Word Separators Names in Logical and Physical Model Physical File naming standards
5 November 2012
87
Continue
Create Aggregates Plan
Identify granularity level
5 November 2012
88
Continue
Establish Clustering Options
Placing and managing related units of data together
5 November 2012
89
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse
5 November 2012 90
Ensure Scalability
Manage Storage Provide Ease of Administration Design Flexibility Assign Storage Structures
5 November 2012
91
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse
5 November 2012 92
Physical Storage
Types of Storage Structure
Files Facts Dimensions Indexed Data Structures
5 November 2012
93
Storage Considerations
Set correct units of units of database space allocation
Data Block
Row Migrating
Continue
Resolve dynamic extension
Inserting a new record. Updating the existing record.
5 November 2012
95
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse
5 November 2012 96
5 November 2012
97
Types of Indexes
B-tree: The default and the most common. B-tree cluster:Defined specifically for cluster. Hash cluster:Defined specifically for a hash cluster Bitmap indexes: Compact, work best for columns with a small set of values.
Bitmap Join Indexes - Index based on one table that involves columns of one or more different tables through a join.
Function-based: Contain the pre- computed value of a function/expression.
5 November 2012
98
B-tree Index
5 November 2012
99
5 November 2012
100
Bitmapped Index
5 November 2012
101
5 November 2012
102
Clustered Index
5 November 2012
103
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse
5 November 2012 104
Initialization Parameters
Data Arrays
5 November 2012 105
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse
5 November 2012 106
5 November 2012
107
Deployment Approaches
Top-Down Approach
Deploy the overall enterprise DWH followed by the dependent data marts,one by one.
Bottom-up Approach
Gather departmental requirements, and deploy the independent data marts,one by one.
Practical/general Approach
Deploy the Subject data marts, one by one, with flexible approach with fully conformed dimension following water fall model.
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse
5 November 2012 109
Security
Prepare a Security Policy
Should cover scope of information, physical security, network and connections, DB access privileges, & access matrix.
5 November 2012
110
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse
5 November 2012 111
5 November 2012
112
DWA - Roles
Building the data warehouse
Ongoing monitoring and maintenance for the data warehouse
5 November 2012
113
Backup Strategy
Should the data be actually discarded or should the data be removed to lower-cost, bulk storage? What criteria should be applied to data to determine whether it is a candidate for removal? Should the data be condensed (profile records, rolling summarization, etc.)? If so, what condensation technique should be used? How should the data be indexed once it is removed (if there is ever to be any attempt to retrieve the data)? Where and how should metadata be stored once the data is removed?
5 November 2012
114
Continue
Should metadata be allowed to be stored in the data warehouse for data that is not actively stored in the warehouse?
What release information should be stored for the base technology (i.e., the DBMS) so that the data as stored will not become stale and unreadable? How reliable (physically) is the media that the data will be stored on?
What seek time to first record is there for the data upon retrieval?
5 November 2012
115
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse
5 November 2012 116
Collection of Statistics
Using Statistics for growth planning. Using Statistics for Fine-Tuning. Publishing Trends for users.
5 November 2012
117
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse
5 November 2012 118
Support
Help Desk support.
Hotline Support Technical Support. User representative.
5 November 2012
119
User Training
User Training Contents
Should provide enough Data Content. Should talk about all the applications involved. Should talks features and usage of tools used.
5 November 2012
120
A.Physical Design Steps B.Physical Design Considerations C.Physical Storage D.Indexing E.Performance Enhancement techniques F.Deployment Activities G.Security H.Backup and recovery I. Monitoring Data Warehouse J. User Training and Support K.Managing Data Warehouse
5 November 2012 121
Platform Upgrades.
Managing Data Growth. Storage Management. ETL Management. Data Model Revisions. Information Delivery Enhancements. Ongoing fine tuning.
5 November 2012
122
Data Management
5 November 2012
12 123 3
Data Governance
Data Architecture
Metadata Management
Metadata Management Metadata is Data about data. Metadata describes how and when and by whom a particular set of data was collected, and how the data is formatted. Metadata Management is becoming very important because as systems become more interdependent, it is vital to know the impact of altering data
Data Architecture
Data Quality Management Data quality assurance (DQA) is the process of verifying the reliability and effectiveness of data. Maintaining data quality requires going through the data periodically and scrubbing it. Typically this involves updating it, standardizing it, and deduplicating records to create a single view of the data, even if it is stored in multiple disparate systems
5 November 2012 125 125
5 November 2012
12 126 6
5 November 2012
127
Why DQM?
Why is this NULL?
Where can I get just one view for all the data? Is empid same as emp_id?
So many duplicate products on this list I am still not able to see latest data Returns on Investment are below expectations
5 November 2012
128
5 November 2012
129
5 November 2012
130
Definition
This indicates how entities are referenced throughout the enterprise Definition problems can be further categorized as below: Synonyms Homonyms The fields EMP_ID, EMPID, and EM01 may or may not all actually refer to the same type of data These indicate fields that are spelled the same, but really arent the same (id or ID)
Relationships Just because a field is named FK_INVOICE doesnt mean that is is really a foreign key to the invoice file.
5 November 2012
131
Domains
Domains describe the range and types of values present that can be present in a data set Some examples of domain errors are: Unexpected values - e.g. Home State = one of {Kan, Mic, Min,) Cardinality - A Yes/No field can have only two credible values Uniqueness for a field, 98% of data is NULL Constant Outliers Length of field Precision Scale Internationalization Date formats, postal codes, time zones, etc
5 November 2012
132
Completeness
This indicates whether or not all of the data is actually present Completeness of dataset can be gauged by its Integrity - Is actual data mapping to our definition of data? Accuracy Name and address matching, demographics check Reliability Zip code should match to city and state Redundancy Data duplication Consistency Is same invoice number referenced with different amounts?
5 November 2012
133
Validity
Validity indicates whether or not the data is valid Validity checks used to spot data problems are Acceptability Product part number should consist of 7character alphanumeric string with two characters and 5 digits Anomalies Timeliness
5 November 2012
134
Data Flows
These checks are related to the aggregate results of movement of data from source to target Many data quality problems can be traced back to incorrect data loads, missed loads or system failures that go unnoticed Data flow checks to ensure data quality are Record counts Reconciliation of source and target record counts Checksums Timestamps Process Time
5 November 2012
135
Structural Integrity
These checks ensure that when data is taken as a whole, you are getting correct results Structural integrity checks include Cardinality Checks between tables Primary keys Are these unique? Referential integrity Product available on invoice but missing from product catalog
5 November 2012
136
Business Rules
Business rule checks measure the degree of compliance between actual data and expected data These checks constitute of Constraints Does the data comply to a known set of validations? Computational rules Is formula for deriving amount correct? Comparisons Functional dependencies Conditions
5 November 2012
137
Transformations
Transformation checks examine the impact of data transformations as data moves from system to system Quality of data can be affected by incorrect transformation logic Only way to identify these are to compare source data set with target data set and verify transformations for Computations Merging Filtering Relationships
5 November 2012
138
5 November 2012
139
Physical Issues
Logical Issues
Data Profile
Business Rules
(for Cleansing)
Data Parsing
(using Rules, Text Mining etc.)
5 November 2012
140
5 November 2012
141
Accuracy as defined by business rules No redundancy across the warehouse Latency - No major change of data between the instance of data capture and when processed
Data Quality
Latency
Consistency
5 November 2012
143
5 November 2012
144
5 November 2012
145
DQM Architecture
5 November 2012
146
5 November 2012
147
DQM Methodology
Modular approach to building solutions Clear and well defined guidelines, checklists and standards Supports the OnsiteOffshore delivery model Flexibility to adapt with other methodologies E-T-V-X criteria reenforced by best practices and TCS quality initiatives
5 November 2012
148
5 November 2012
149
5 November 2012
150
Ab Initio : Provides a suite of software packages used for ETL in data warehouses. Its features include
Parallel data transformation , validation and filtering, real time data capture and integration with relational DBMS systems and Data Profiling Capability
Unitech and Actuate : Comprises of a set of reporting tools used for trend analysis, point to point
reconciliation and detecting data inconsistency.
5 November 2012
151
5 November 2012
15 152 2
5 November 2012
153
Business Objectives
Reduction in marketing operational costs Reduction of marketing cycle times 360 degree view of customers Delivery of consistent messages across all customer channels Increased customer focus, better understanding and segmentation of customers Event driven campaigns targeted at focused customer group Improvement in campaign effectiveness Maintaining a large data volume - One of the largest data warehouses in Europe 3.36 TB of data with a growth projection of 1% per week
5 November 2012
154
Challenges
Data quality issues in the vital data attributes in BTs operational data store Cleansing a backlog of erroneous information stored in database Decommissioning and migration of data from legacy systems Decommissioning of 30 TB of RDBMS Maintaining data integrity
5 November 2012
155
5 November 2012
156
Profile
Analyze
5 November 2012
157
Hardware
IBM Sequent NUMA-Q server 16 quad machine and 2GB RAM
5 November 2012
158
Application Architecture
Profile Full volume Source Data Legacy System Sales System ERP Profiled Information Data Analysts, Business Owners, Design document DQ reports
Reduces Data Assumption Use Profiler output for different Design phases Use Profiler output to build BRR Embed the rules defined in BRR into Data Profiler Data Audit and Data Quality Monitors (DQM)
Requirement Analysis Solution Design High/Low Source Level Design system Test Data
DQ Monitors
CRM
Billing System
Profile Live data for Data Audit Deployment Live Target System
Profile Target Test Data Compare and analyse profiler output to validate the transformation process
5 November 2012
159
Benefits to Client
A huge backlog of data quality issues resolved leading to millions of pounds worth of saving to BT A generic name and address data cleansing methodology designed that can be used as a prototype for similar requirement time effectively Profiling of Live data on an ongoing basis to check compliance over time BRR was used for developing Data Quality monitors (DQM) Development of uniform data dictionary for all disparate source systems Reduced Risks and accurate planning
Client Speak
...We have made fantastic progress in managing to roll out some really big, complex deliveries... all thanks to your commitment, and your ability to work as a team in order to resolve issues quickly whilst under a lot of pressure. Well done everyone. - Simon Riley
5 November 2012
160
Metadata Management
5 November 2012
16 161 1
Metadata
Data about Data
5 November 2012
162
Metadata Products
Candidate Products SuperGlue System Architect MetaStage Platinum Repository Advantage Rochade Microsoft Repository MetaCentre Repository Informatica Popkin Ascential Platinum Technologies Computer Associates Allen Systems Group Platinum/Microsoft Partnership Data Advantage Group Vendor
MDM
I2
5 November 2012
163
Data Governance
5 November 2012
16 164 4
Data Governance
Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise
Why go in for it ?
5 November 2012
165
5 November 2012
16 166 6
Master Data
Metadata
5 November 2012
167 167
5 November 2012
168 168
5 November 2012
169 169
5 November 2012
170 170
5 November 2012
17 171 1
5 November 2012
172
5 November 2012
173
5 November 2012
174