ETL Testing in Less Time

ETL Testing in Less Time, With Greater Coverage, to Deliver Trusted
Data
Much ETL testing today is done by SQL scripting or eyeballing of data on spreadsheets. These
approaches to ETL testing are very time-consuming, error-prone, and seldom provide complete test
coverage. Informatica Data Validation Option provides an ETL testing tool that can accelerate and
automate ETL testing in both production environments and development & test. This means that you
can deliver complete, repeatable and auditable test coverage in less time with no programming skills
required.
ETL Testing Use Cases
Production Validation Testing (testing data before moving into production). Sometimes called
table balancing or production reconciliation, this type of ETL testing is done on data as it is
being moved into production systems. The data in your production systems has to be right in
order to support your business decision making. Informatica Data Validation Option provides the
ETL testing automation and management capabilities to ensure that your production systems are
not compromised by the data update process.
Source to Target Testing (data is transformed). This type of ETL testing validates that the data
values after a transformation are the expected data values. The Informatica Data Validation
Option has a large set of pre-built operators to build this type of ETL testing with no programming
skills required.
Application Upgrades (same-to-same ETL testing). This type of ETL testing validates that the
data coming from an older application or repository is exactly the same as the data in the new
application or repository. Must of this type of ETL testing can be automatically generated, saving
substantial test development time.
Benefits of ETL Testing with Data Validation Option
Production Reconciliation. Informatica Data Validation Option provides automation and
visibility for ETL testing, to ensure that you deliver trusted data in your production system
updates.
IT Developer Productivity. 50% to 90% less time and resources required to do ETL testing
Data Integrity. Comprehensive ETL testing coverage means lower business risk and greater
confidence in the data.

ETL testing Fundamentals
by HariprasadT on March 29, 2012 in Are You Being Served
Get the latest updates on Are We Being Served directly in your inbox. Subscribe now
Introduction:
Comprehensive testing of a data warehouse at every point throughout the ETL (extract, transform, and load)
process is becoming increasingly important as more data is being collected and used for strategic decision-
making. Data warehouse or ETL testing is often initiated as a result of mergers and acquisitions, compliance
and regulations, data consolidation, and the increased reliance on data-driven decision making (use of
Business Intelligence tools, etc.). ETL testing is commonly implemented either manually or with the help of a
tool (functional testing tool, ETL tool, proprietary utilities). Let us understand some of the basic ETL concepts.
BI / Data Warehousing testing projects can be conjectured to be divided into ETL (Extract Transform Load)
testing and henceforth the report testing.
Extract Transform Load is the process to enable businesses to consolidate their data while moving it from
place to place (i.e.) moving data from source systems into the data warehouse. The data can arrive from any
source:
Extract - It can be defined as extracting the data from numerous heterogeneous systems.
Transform - Applying the business logics as specified b y the business on the data derived from sources.
Load - Pumping the data into the final warehouse after completing the above two process. The ETL part of the
testing mainly deals with how, when, from, where and what data we carry in our data warehouse from which
the final reports are supposed to be generated. Thus, ETL testing spreads across all and each stage of data
flow in the warehouse starting from the source databases to the final target warehouse.
Star Schema
The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the
entity-relationship diagram of this schema resembles a star, with points radiating from a central table. The
center of the star consists of a large fact table and the points of the star are the dimension tables.
A star schema is characterized by one OR more of very large fact tables that contain the primary information in
the data warehouse, and a number of much smaller dimension tables (OR lookup tables), each of which
contains information about the entries for a particular attribute in the fact table.
A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to
the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The
cost-based optimizer recognizes star queries and generates efficient execution plans for them. A typical fact
table contains keys and measures. For example, in the sample schema, the fact table sales, contain the
measures, quantity sold, amount, average, the keys time key, item-key, branch key, and location key.
The dimension tables are time, branch, item and location.
Snow-Flake Schema
The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star
schema. It is called a snowflake schema because the diagram of the schema resembles
a snowflake. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data
has been grouped into multiple tables instead of one large table.
For example, a location dimension table in a star schema might be normalized into a location table and city
table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires
more foreign key joins. The result is more complex queries and reduced query performance. Figure above
presents a graphical representation of a snowflake schema.
When to use star schema and snowflake schema?
When we refer to Star and Snowflake Schemas, we are talking about a dimensional model for a
Data Warehouse or a Datamart. The Star schema model gets it name from the design appearance because
there is one central fact table surrounded by many dimension tables. The relationship between the fact and
dimension tables is created by PK -> FK relationship and the keys are generally surrogate to the natural or
business key of the dimension tables. All data for any given dimension is stored in the one dimension table.
Thus, the design of the model could potentially look like a STAR. On the other hand, the Snowflake schema
model breaks the dimension data into multiple tables for the purpose of making the data more easily
understood or for reducing the width of the dimension table. An example of this type of schema might be a
dimension with Product data of multiple levels. Each level in the Product Hierarchy might have multiple
attributes that are meaningful only to that level. Thus, one would break the single dimension table into multiple
tables in a hierarchical fashion with the highest level tied to the fact table. Each table in the dimension hierarchy
would be tied to the level above by natural or business key where the highest level would be tied to the fact
table by a surrogate key. As you can imagine the appearance of this schema design could resemble
the appearance of a snowflake.
Types of Dimensions Tables
Type 1: This is straightforward r e f r e s h . The fields are constantly overwritten and history is not kept for the
column. For example should a description change for a Product number,the old value will be over written by the
new value.
Type 2: This is known as a slowly changing dimension, as history can be kept. The column(s) where the
history is captured has to be defined. In our example of the Product description changing for a product number,
if the slowly changing attribute captured is the product description, a new row of data will be created showing
the new product description. The old description will still be contained in the old.
Type 3: This is also a slowly changing dimension. However, instead of a new row, in the example, the old
product description will be moved to an old value column in the dimension, while the new description will
overwrite the existing column. In addition, a date stamp column exists to say when the value was updated.
Although there will be no full history here, the previous value prior to the update is captured. No new rows will
be created for history as the attribute is measured for the slowly changing value.
Types of fact tables:
Transactional: Most facts will fall into this category. The transactional fact will capture transactional data such
as sales lines or stock movement lines. The measures for these facts can be summed together.
Snapshot: A snapshot fact will capture the current data for point for a day. For example, all the current stock
positions, where items are, in which branch, at the end of a working day can be captured.
Snapshot fact measures can be summed for this day, but cannot be summed across more than 2 snapshot
days as this data will be incorrect.
Accumulative: An accumulative snapshot will sum data up for an attribute, and is not based on time. For
example, to get the accumulative sales quantity for a sale of a particular product, the row of data will be
calculated for this row each night giving an accumulative value.
Key hit-points in ETL testing are:There are several levels of testing that can be performed during data
warehouse testing and they should be defined as part of the testing strategy in different phases (Component
Assembly, Product) of testing. Some examples include:
1. Constraint Testing: During constraint testing, the objective is to validate unique constraints, primary keys,
foreign keys, indexes, and relationships. The test script should include these validation points. Some ETL
processes can be developed to validate constraints during the loading of the warehouse. If the decision is
made to add constraint validation to the ETL process, the ETL code must validate all business rules and
relational data requirements. In Automation, it should be ensured that the setup is done correctly and
maintained throughout the ever-changing requirements process for effective testing. An alternative
to automation is to use manual queries. Queries are written to cover all test scenarios and executed manually.
2. Source to Target Counts: The objective of the count test scripts is to determine if the record counts in the
source match the record counts in the target. Some ETL processes are capable of capturing record count
information such as records read, records written, records in error, etc. If the ETL process used can capture
that level of detail and create a list of the counts, allow it to do so. This will save time during the validation
process. It is always a good practice to use queries to double check the source to target counts.
3. Source to Target Data Validation: No ETL process is smart enough to perform source to target field-to-field
validation. This piece of the testing cycle is the most labor intensive and requires the most thorough analysis of
the data. There are a variety of tests that can be performed during source to target validation. Below is a list of
tests that are best practices:
4. Transformation and Business Rules: Tests to verify all possible outcomes of the transformation rules,
default values, straight moves and as specified in the Business Specification document. As a special mention,
Boundary conditions must be tested on the business rules.
5. Batch Sequence & Dependency Testing: ETLs in DW are essentially a sequence of processes that
execute in a particular sequence. Dependencies do exist among various processes and the same is critical to
maintain the integrity of the data. Executing the sequences in a wrong order might result in inaccurate data in
the warehouse. The testing process must include at least 2 iterations of the endend execution of the whole
batch sequence. Data must be checked for its integrity during this testing. The most common type of errors
caused because of incorrect sequence is the referential integrity failures, incorrect end-dating (if applicable)
etc, reject
records etc.
6. Job restart Testing: In a real production environment, the ETL jobs/processes fail because of number of
reasons (say for ex: database related failures, connectivity failures etc). The jobs can fail half/partly executed. A
good design always allows for a restart ability of the jobs from the failure point. Although this is more of a
design suggestion/approach, it is suggested that every ETL job is built and tested for restart capability.
7. Error Handling: Understanding a script might fail during data validation, may confirm the ETL process is
working through process validation. During process validation the testing team will work to identify additional
data cleansing needs, as well as identify consistent error patterns that could possibly be diverted by modifying
the ETL code. It is the responsibility of the validation team to identify any and all records that seem suspect.
Once a record has been both data and process validated and the script has passed, the ETL process is
functioning correctly. Conversely, if suspect records have been identified and documented during data
validation those are not supported through process validation, the ETL process is not functioning correctly.
8. Views: Views created on the tables should be tested to ensure the attributes mentioned in the views are
correct and the data loaded in the target table matches what is being reflected in the views.
9. Sampling: Sampling will involve creating predictions out of a representative portion of the data that is to be
loaded into the target table; these predictions will be matched with the actual results obtained from the data
loaded for business Analyst Testing. Comparison will be verified to ensure that the predictions match the data
loaded into the target table.
10. Process Testing: The testing of intermediate files and processes to ensure the final outcome is valid and
that performance meets the system/business need.
11. Duplicate Testing: Duplicate Testing must be performed at each stage of the ETL process and in the final
target table. This testing involves checks for duplicates rows and also checks for multiple rows with same
primary key, both of which cannot be allowed.
12. Performance: It is the most important aspect after data validation. Performance testing should check if the
ETL process is completing within the load window.
13. Volume: Verify that the system can process the maximum expected quantity of data for a given cycle in the
time expected.
14.Connectivity Tests: As the name suggests, this involves testing the upstream, downstream interfaces and
intra DW connectivity. It is suggested that the testing represents the exact transactions between these
interfaces. For ex: If the design approach is to extract the files from source system, we should actually test
extracting a file out of the system and not just the
connectivity.
15. Negative Testing: Negative Testing checks whether the application fails and where it should fail with
invalid inputs and out of boundary scenarios and to check the behavior of the application.
16. Operational Readiness Testing (ORT): This is the final phase of testing which focuses on verifying the
deployment of software and the operational readiness of the application. The main areas of testing in this
phase include:
Deployment Test
1. Tests the deployment of the solution
2. Tests overall technical deployment checklist and timeframes
3. Tests the security aspects of the system including user authentication and
authorization, and user-access levels.
Conclusion
Evolving needs of the business and changes in the source systems will drive continuous change in the data
warehouse schema and the data being loaded. Hence, it is necessary that development and testing processes
are clearly defined, followed by impact-analysis and strong alignment between development, operations and
the business.

Data Quality in Data Warehouse
by Mallikharjuna Pagadala on October 25, 2012 in ETL and BI Testing, Quality Assurance and Testing Services
Get Quality Assurance & Testing Services updates directly in your inbox. Subscribe now
Data Quality in Data Warehouse
Poor-quality data creates problems for both sides of the houseIT and business. According to a study
published by The Data Warehousing Institute (TDWI) entitled taking Data Quality to the enterprise through Data
Governance, some issues are primarily technical in nature, such as the extra time required for reconciling data
or delays in deploying new systems. Other problems are closer to business issues, such as customer
dissatisfaction, compliance problems and revenue loss. Poor-quality data can also cause problems with costs
and credibility.
Data quality affects all data-related projects and refers to the state of completeness, validity, consistency,
timeliness and accuracy that makes data appropriate for a specific use. This means that in any kind of project
related to data, one has to ensure the best possible quality by checking the right syntax of columns, detecting
missing values, optimizing relationships, and correcting any other inconsistencies.
Expected Features for any Data Quality Tool are listed below.
Table Analysis
o Business Rule analysis
o Functional Dependency
o Column set Analysis
Data Consistency validation
Columns from different Tables
Tables from the same database
Tables from the different databases
Data source as file can be compared with current database
Results in tabular/ Graph format
Powerful pattern searching capability Regex functions
Data Profiling capabilities
Has option to store functions as library/ reusable component
Metadata Repository
Can be used as a testing tool for DB/ ETL projects
Quickly Browsing Data Structures
Getting an Overview of Database Content
Do Columns Contain Null or Blank Values?
About Redundant Values in a Column
Is a Max/Min Value for a Column Expected
What is the best selling product?
Using Statistics
Analyzing a Date Column
Analyzing Intervals in Numeric Data
Targeting Your Advertising
Identify and Correct Bad DataDate, Zip Code
Getting a Columns Pattern
Detecting Keys in Tables
Using the Business Rule (Data Quality Rule)
Are There Duplicate Records in my Data?
Column Comparison Analysis
Discover Duplicate Tables
Recursive Relationships: Does Supervisor ID also Exist as Employee ID?
Deleting Redundant Columns
Executing Text Analysis
Creating a Correlation Analysis
Storing and Running Your Own Queries
Creating a Report (PDF/HTML/XML)
Can data be Corrected Using Soundex?
Talend Open Studio for Data Quality helps discover and understand the quality of data in the data warehouse
and addresses all the above mentioned features. Easy to carry out accurate data profiling processes and thus
reduce the time and resources needed to find data anomalies. Its comprehensive data profiling features will
help enhance and accelerate data analysis tasks.
Strategies for Testing Data Warehouse Applications
by VinodG on March 29, 2012 in ETL and BI Testing
Introduction:
There is an exponentially increasing cost associated with finding software defects later in the development
lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect
data to make critical business decisions. Given the importance of early detection of software defects, lets first
review some general goals of testing an ETL application:
Below content describes the various common strategies used to test the Data warehouse system:
Data completeness: Ensures that all expected data is loaded in to target table.
1. Compare records counts between source and target..check for any rejected records.
2. Check Data should not be truncated in the column of target table.
3. Check unique values has to load in to the target. No duplicate records should be existing.
4. Check boundary value analysis (ex: only >=2008 year data has to load into the target)
Data Quality:
1.Number check: if in the source format of numbering the columns are as xx_30 but if the target is only 30 then
it has to load not pre_fix(xx_) .. we need to validate.
2. Date Check: They have to follow Date format and it should be same across all the records. Standard format
: yyyy-mm-dd etc..
3. Precision Check: Precision value should display as expected in the target table.
Example: In source 19.123456 but in the target it should display as 19.123 or round of 20.
4. Data Check: Based on business logic, few record which does not meet certain criteria should be filtered out.
Example: only record whose date_sid >=2008 and GLAccount != CM001 should only load in the
target table.
5. Null Check: Few columns should display Null based on business requirement
Example: Termination Date column should display null unless & until if his Active status
Column is T or Deceased.
Note: Data cleanness will be decided during design phase only.
Data cleanness:
Unnecessary columns should be deleted before loading into the staging area.
1. Example: If a column have name but it is taking extra space , we have to trim space so before loading in
the staging area with the help of expression transformation space will be trimed.
2. Example: Suppose telephone number and STD code in different columns and requirement says it should be
in one column then with the help of expression transformation we will concatenate the values in one column.
Data Transformation: All the business logic implemented by using ETL-Transformation should reflect.
Integration testing:
Ensures that the ETL process functions well with other upstream and downstream processes.
Example:
1. Downstream:Suppose if you are changing precision in one of the transformation column, let us assume a
EMPNO is column having data type with size 16, this data type precision should be same for all
transformation where ever this EMPNO column is used.
2. Upstream: If the source is SAP/ BW and we are extracting data there will be ABAP code which will act as
interface between SAP/ BW and map where there source is SAP /BW and to modify existing mapping we have
to re-generate the ABAP code in the ETL tool (informatica)., if we dont do it, wrong data will be extracted since
ABAP code is not updated.
User-acceptance testing:
Ensures the solution meets users current expectations and anticipates their future expectations.
Example: Make sure none of the code should be hardcoded.
Regression testing:
Ensures existing functionality remains intact each time a new release of code is completed.
Conclusion:
Taking these considerations into account during the design and testing portions of building a data warehouse
will ensure that a quality product is produced and prevent costly mistakes from being discovered in production.
Data Integration Challenge Capturing Changes
by Muneeswara C Pandian on July 13, 2007 in Business Intelligence A Practitioners View
Get the latest updates on Business Intelligence directly in your inbox. Subscribe now
When we receive the data from source systems, the data file will not carry a flag indicating whether the record
provided is new or has it changed. We would need to build process to determine the changes and then push
them to the target table.

There are two steps to it
1. Pull the incremental data from the source file or table
2. Process the pulled incremental data and determine the impact of it on the target
table as Insert or Update or Delete
Step 1: Pull the incremental data from the source file or table
If source system has audit columns like date then we can find the new records else we will not be able to find
the new records and have to consider the complete data
For source systems file or table that has audit columns, we would follow the below steps
1. While reading the source records for a day (session), find the maximum value of
date(audit filed) and store in a persistent variable or a temporary table
2. Use this persistent variable value as a filter in the next day to pull the incremental
data from the source table
Step 2: Determine the impact of the record on target table as Insert/Update/ Delete
Following are the scenarios that we would face and the suggested approach
1. Data file has only incremental data from Step 1 or the source itself provide only
incremental data
o do a lookup on the target table and determine whether its a new record or an existing record
o if an existing record then compare the required fields to determine whether its an updated record
o have a process to find the aged records in the target table and do a clean up for deletes
2. Data file has full complete data because no audit columns are present
o The data is of higher
have a back up of the previously received file
perform a comparison of the current file and prior file; create a change file by determining the inserts,
updates and deletes. Ensure both the current and prior file are sorted by key fields
have a process that reads the change file and loads the data into the target table
based on the change file volume, we could decide whether to do a truncate & load
o The data is of lower volume
do a lookup on the target table and determine whether its a new record or an existing record
if an existing record then compare the required fields to determine whether its an updated record
have a process to find the aged records in the target table and do a clean up or delete

Data Integration Challenge Storing Timestamps
by Muneeswara C Pandian on October 3, 2008 in Business Intelligence A Practitioners View
Storing timestamps along with a record indicating its new arrival or a change in its value is a must in a data warehouse.
We always take it for granted, adding timestamp fields to table structures and tending to miss that the amount of
storage space a timestamp field can occupy is huge, the storage occupied by timestamp is almost double against a
integer data type in many databases like SQL Server, Oracle and if we have two fields one as insert timestamp and other
field as update timestamp then the storage spaced required gets doubled. There are many instances where we could
avoid using timestamps especially when the timestamps are being used for primarily for determining the incremental
records or being stored just for audit purpose.
How to effectively manage the data storage and also leverage the benefit of a timestamp field?
One way of managing the storage of timestamp field is by introducing a process id field and a process table. Following
are the steps involved in applying this method in table structures and as well as part of the ETL process.

Data Structure
1. Consider a table name PAYMENT with two fields with timestamp data type like
INSERT_TIMESTAMP and UPDATE_TIEMSTAMP used for capturing the changes
for every present in the table
2. Create a table named PROCESS_TABLE with columns PROCESS_NAME
Char(25), PROCESS_ID Integer and PROCESS_TIMESTAMP Timestamp
3. Now drop the fields of the TIMESTAMP data type from table PAYMENT
4. Create two fields of integer data type in the table PAYMENT like
INSERT_PROCESS_ID and UPDATE_PROCESS_ID
5. These newly created id fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID
would be logically linked with the table PROCESS_TABLE through its field
PROCESS_ID
ETL Process

1. Let us consider an ETL process called Payment Process that loads data into the
table PAYMENT
2. Now create a pre-process which would run before the payment process, in the pre-
process build the logic by which a record is inserted with the values like (payment
process, SEQUNCE Number, current timestamp) into the PROCESS_TABLE table.
The PROCESS_ID in the PROCESS_TABLE table could be defined as a database
sequence function.
3. Pass the currently generated PROCESS_ID of PROCESS_TABLE as
current_process_id from pre-process step to the payment process ETL process
4. In the payment process if a record is to inserted into the PAYMENT table then the
current_prcoess_id value is set to both the columns INSERT_PROCESS_ID and
UPDATE_PROCESS_ID else if a record is getting updated in the PAYMENT table
then the current_process_id value is set to only the column UPDATE_PROCESS_ID
5. So now the timestamp values for the records inserted or updated in the table
PAYMENT can be picked from the PROCESS_TABLE by joining by the
PROCESS_ID with the INSERT_PROCESS_ID and UPDATE_PROCESS_ID
columns of the PAYMENT table
Benefits
The fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID occupy less space when compared to the
timestamp fields
Both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID are Index friendly
Its easier to handle these process id fields in terms picking the records for determining the incremental
changes or for any audit reporting.

First Step in Knowing your Data Profile It
by Karthikeyan Sankaran on June 11, 2007 in Business Intelligence A Practitioners View
Chief Data Officer (CDO), the protagonist, who was introduced before on this blog has the unenviable task
of understanding the data that is within the organization boundaries. Having categorized the data into 6
MECE sets (read the post dated May 29 on this blog), the data reconnaissance team starts its mission with
the first step Profiling.

Data Profiling at the most fundamental level involves understanding of:
1) How is the data defined?
2) What is the range of values that the data element can take?
3) How is the data element related to others?
4) What is the frequency of occurrence of certain values, etc.
A slightly more sophisticated definition of Data Profiling would include analysis of data elements in terms of:
Basic statistics, frequencies, ranges and outliers
Numeric range analysis
Identify duplicate name and address and non-name and address information
Identify multiple spellings of the same content
Identify and validate redundant data and primary/foreign key relationships across data sources
Validate data specific business rules within a single record or across sources
Discover and validate data patterns and formats
Armed with statistical information about critical data present in enterprise wide systems, theCDOs team can
devise specific strategies to improve the quality of data and hence the improve the quality of information and
business decisioning.

Collaborative Data Management Need of the hour!
by Satesh Kumar on September 28, 2012 in Business Intelligence A Practitioners View
Well the topic may seem like a pretty old concept, yet a vital one in the age of Big Data, Mobile BI and the
Hadoops! As per FIMA 2012 benchmark report Data Quality (DQ) still remains as the topmost priority in data
management strategy:

What gets measured improves! But often Data Quality (DQ) initiative is a reactive strategy as opposed to
being a pro-active one; consider the impact bad data could have in a financial reporting scenario brand
tarnish, loss of investor confidence.
But are the business users aware of DQ issue? A research report by The Data Warehousing Institute,
suggested that more that 80% of the business managers surveyed believed that the business data was fine,
but just half of their technical counterparts agreed on the same!!! Having recognized this disparity, it would be a
good idea to match the dimensions of data and the business problem created due to lack of data quality.
Data Quality Dimensions IT Perspective
Data Accuracy the degree to which data reflects the real world
Data Completeness inclusion of all relevant attributes of data
Data Consistency uniformity of data across the enterprise
Data Timeliness Is the data up-to-date?
Data Audit ability Is the data reliable?
Business Problems Due to Lack of Data Quality
Department/End-
Users
Business Challenges Data Quality
Dimension*
Human
Resources
The actual employee performance as reviewed by the
manager is not in sync with the HR database, Inaccurate
employee classification based on government
classification groups minorities, differently abled
Data consistency,
accuracy
Marketing Print and mailing costs associated with sending duplicate
copies of promotional messages to the same
customer/prospect, or sending it to the wrong
address/email
Data timeliness
Customer Service Extra call support minutes due to incomplete data with
regards to customer and poorly-defined metadata for
Data completeness
knowledge base
Sales Lost sales due to lack of proper customer
purchase/contact information that paralysis the
organization from performing behavioral analytics
Data consistency,
timeliness
C Level Reports that drive top management decision making are
not in sync with the actual operational data, getting a
360
o
view of the enterprise
Data consistency
Cross Functional Sales and financial reports are not in sync with each
other typically data silos
Data consistency, audit
ability
Procurement The procurement level of commodities are different from
the requirement of production resulting in
excess/insufficient inventory
Data consistency,
accuracy
Sales Channel There are different representations of the same product
across ecommerce sites, kiosks, stores and the product
names/codes in these channels are different from those
in the warehouse system. This results in delays/wrong
items being shipped to the customer
Data consistency,
accuracy
*Just a perspective, there could be other dimensions causing these issues too
As it is evident, data is not just an IT issue but a business issue too and requires a Collaborative Data
Management approach (including business and IT) towards ensuring quality data. The solution is multifold
starting from planning, execution and sustaining a data quality strategy. Aspects such as data profiling, MDM,
data governance are vital guards that helps to analyze data, get first-hand information on its quality and to
maintain its quality on an on-going basis.
Collaborative Data Management Approach

Key steps in Collaborative Data Management would be to:
Define and measure metrics for data with business team
Assess existing data for the metrics carry out a profiling exercise with IT team
Implement data quality measures as a joint team
Enforce a data quality fire wall (MDM) to ensure correct data enters the information ecosystem as a
governance process
Institute Data Governance and Stewardship programs to make data quality a routine and stable practice at
a strategic level
This approach would ensure that the data ecosystem within a company is distilled as it involves business and
IT users from each department at all hierarchy.
Thanks for reading, would appreciate your thoughts.

A Proactive Approach to Building an Effective Data Warehouse
by Gnana Krishnan on February 4, 2013 in Business Intelligence A Practitioners View
We cant solve problems by using the same kind of thinking we used when we created them. The famous
quote attributed to Albert Einstein applies as much to Business Intelligence & Analytics as it does to other
things. Many organizations that turn to BI&A for help on strategic business concerns such as increasing
customer churn, drop in quality levels, missed revenue opportunities face disappointment. One of the important
reasons for this is that the data that can provide such insights is just not there. For example, to understand the
poor sales performance in a particular region during a year, it will not just help to have data about our sales
plan, activities, opportunities, conversions and sales achieved / missed, it will also require understanding of
other disruptive forces such as competitors promotions, change in customer preferences, new entrants or
alternatives.
Thomas Davenport, a household name in the BI&A community, in his book Analytics at Work, explains the
analytical DELTA (Data, Enterprise, Leadership, Targets and Analysts), a framework that organizations could
adopt to implement analytics effectively for better business decisions and results. He emphasizes that besides
the necessity of having clean, integrated and enterprise-wide data in a warehouse, it is also important that the
data enables to measure something new and important.
Now, measuring something new and important cannot just be arbitrary. It requires be in line with the
organizational strategy so that this measurement will have an impact on strategic decision-making. A proactive
approach to Data Warehousing must then include such measurements and identify the necessary datasets that
enable the measurement. For instance, an important element of a companys strategy to keep its cost down
could be to standardize on a selected few suppliers. To identify the right suppliers and make this consolidation
work, it is important to analyze procurement history, which under normal circumstances might be treated like a
throw-away operational Accounts Payable data whose value expires once paid. It is even possible that an
organization does not currently have (or) have access to the necessary data, but this knowledge is essential to
guide the efforts and initiatives of data warehousing.
To summarize, building an effective data warehouse requires a proactive approach. A proactive approach
essentially implies that the organization makes a conscious effort to understand the business imperatives for
the data warehouse; identify new metrics that best represent the objectives and proactively seek the data that
is necessary to support the metrics. This approach can produce radically different results compared to the
reactive approach of analyzing the data that is routinely available.

ETL Testing in Less Time

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

ETL Testing in Less Time

Transféré par

Droits d'auteur :

Formats disponibles

ETL Testing in Less Time, With Greater Coverage, to Deliver Trusted

Vous aimerez peut-être aussi