Vous êtes sur la page 1sur 43

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile

collection of data in support of management's decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source
A and source B may have different ways of identifying a product, but in a data warehouse, there
will be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse
can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.

Ralph Kimball provided a more concise definition of a data warehouse:

A data warehouse is a copy of transaction data specifically structured for query and
analysis.

This is a functional view of a data warehouse. Kimball did not address how the data warehouse
is built like Inmon did, rather he focused on the functionality of a data warehouse.

Data Warehouse Architecture

Different data warehousing systems have different structures. Some may have an ODS
(operational data store), while some may have multiple data marts. Some may have a small
number of data sources, while some may have dozens of data sources. In view of this, it is far
more reasonable to present the different layers of a data warehouse architecture rather than
discussing the specifics of any one system.

In general, all data warehouse systems have the following layers:

Data Source Layer


Data Extraction Layer
Staging Area
ETL Layer
Data Storage Layer
Data Logic Layer
Data Presentation Layer
Metadata Layer
System Operations Layer

The picture below shows the relationships among the different components of the data
warehouse architecture:

Each component is discussed individually below:

Data Source Layer

This represents the different data sources that feed data into the data warehouse. The data
source can be of any format -- plain text file, relational database, other types of database, Excel
file, can all act as a data source.

Many different types of data can be a data source:

Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data.
Web server logs with user browsing data.
Internal market research data.
Third-party data, such as census data, demographics data, or survey data.

All these data sources together form the Data Source Layer.

Data Extraction Layer


Data gets pulled from the data source into the data warehouse system. There is likely some minimal data
cleansing, but there is unlikely any major data transformation.

Staging Area

This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having
one common area makes it easier for subsequent data processing / integration.

ETL Layer

This is where data gains its "intelligence", as logic is applied to transform the data from a transactional
nature to an analytical nature. This layer is also where data cleansing happens.

Data Storage Layer

This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities
can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you
may have just one of the three, two of the three, or all three types.

Data Logic Layer

This is where business rules are stored. Business rules stored here do not affect the underlying data
transformation rules, but does affect what the report looks like.

Data Presentation Layer

This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in
a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns
users of exceptions, among others.

Metadata Layer

This is where information about the data stored in the data warehouse system is stored. A logical data
model would be an example of something that's in the metadata layer.

System Operations Layer

This layer includes information on how the data warehouse system operates, such as ETL job status,
system performance, and user access history.

Data Warehouse Design

After the tools and team personnel selections are made, the data warehouse design can begin.
The following are the typical steps involved in the datawarehousing project cycle.

Requirement Gathering
Physical Environment Setup

Data Modeling

ETL

OLAP Cube Design

Front End Development

Report Development

Performance Tuning

Query Optimization

Quality Assurance

Rolling out to Production

Production Maintenance

Incremental Enhancements

Requirement Gathering

Task Description

The first thing that the project team should engage in is gathering requirements from end users.
Because end users are typically not familiar with the data warehousing process or concept, the
help of the business sponsor is essential. Requirement gathering can happen as one-to-one
meetings or as Joint Application Development (JAD) sessions, where multiple people are talking
about the project scope in the same meeting.

The primary goal of this phase is to identify what constitutes as a success for this particular
phase of the data warehouse project. In particular, end user reporting / analysis requirements
are identified, and the project team will spend the remaining period of time trying to satisfy these
requirements.

Associated with the identification of user requirements is a more concrete definition of other
details such as hardware sizing information, training requirements, data source identification,
and most importantly, a concrete project plan indicating the finishing date of the data
warehousing project.
Based on the information gathered above, a disaster recovery plan needs to be developed so
that the data warehousing system can recover from accidents that disable the system. Without
an effective backup and restore strategy, the system will only last until the first major disaster,
and, as many data warehousing DBA's will attest, this can happen very quickly after the project
goes live.

Time Requirement

2 - 8 weeks.

Deliverables

A list of reports / cubes to be delivered to the end users by the end of this current phase.
A updated project plan that clearly identifies resource loads and milestone delivery
dates.

Possible Pitfalls

This phase often turns out to be the most tricky phase of the data warehousing implementation.
The reason is that because data warehousing by definition includes data from multiple sources
spanning many different departments within the enterprise, there are often political battles that
center on the willingness of information sharing. Even though a successful data warehouse
benefits the enterprise, there are occasions where departments may not feel the same way. As
a result of unwillingness of certain groups to release data or to participate in the data
warehousing requirement definition, the data warehouse effort either never gets off the ground,
or could not start in the direction originally defined.

When this happens, it would be ideal to have a strong business sponsor. If the sponsor is at the
CXO level, she can often exert enough influence to make sure everyone cooperates.

Physical Environment Setup

Task Description

Once the requirements are somewhat clear, it is necessary to set up the physical servers and
databases. At a minimum, it is necessary to set up a development environment and a
production environment. There are also many data warehousing projects where there are three
environments: Development, Testing, and Production.

It is not enough to simply have different physical environments set up. The different processes
(such as ETL, OLAP Cube, and reporting) also need to be set up properly for each environment.
It is best for the different environments to use distinct application and database servers. In other
words, the development environment will have its own application server and database servers,
and the production environment will have its own set of application and database servers.

Having different environments is very important for the following reasons:

All changes can be tested and QA'd first without affecting the production environment.
Development and QA can occur during the time users are accessing the data
warehouse.

When there is any question about the data, having separate environment(s) will allow the
data warehousing team to examine the data without impacting the production
environment.

Time Requirement

Getting the servers and databases ready should take less than 1 week.

Deliverables

Hardware / Software setup document for all of the environments, including hardware
specifications, and scripts / settings for the software.

Possible Pitfalls

To save on capital, often data warehousing teams will decide to use only a single database and
a single server for the different environments. Environment separation is achieved by either a
directory structure or setting up distinct instances of the database. This is problematic for the
following reasons:

1. Sometimes it is possible that the server needs to be rebooted for the development
environment. Having a separate development environment will prevent the production
environment from being impacted by this.

2. There may be interference when having different database environments on a single box. For
example, having multiple long queries running on the development database could affect the
performance on the production database.

Data Modeling

Task Description
This is a very important step in the data warehousing project. Indeed, it is fair to say that the
foundation of the data warehousing system is the data model. A good data model will allow the
data warehousing system to grow easily, as well as allowing for good performance.

In data warehousing project, the logical data model is built based on user requirements, and
then it is translated into the physical data model. The detailed steps can be found in
theConceptual, Logical, and Physical Data Modeling section.

Part of the data modeling exercise is often the identification of data sources. Sometimes this
step is deferred until the ETL step. However, my feeling is that it is better to find out where the
data exists, or, better yet, whether they even exist anywhere in the enterprise at all. Should the
data not be available, this is a good time to raise the alarm. If this was delayed until the ETL
phase, rectifying it will becoming a much tougher and more complex process.

Time Requirement

2 - 6 weeks.

Deliverables

Identification of data sources.


Logical data model.

Physical data model.

Possible Pitfalls

It is essential to have a subject-matter expert as part of the data modeling team. This person
can be an outside consultant or can be someone in-house who has extensive experience in the
industry. Without this person, it becomes difficult to get a definitive answer on many of the
questions, and the entire project gets dragged out.

ETL

Task Description
The ETL (Extraction, Transformation, Loading) process typically takes the longest to develop,
and this can easily take up to 50% of the data warehouse implementation cycle or longer. The
reason for this is that it takes time to get the source data, understand the necessary columns,
understand the business rules, and understand the logical and physical data models.

Time Requirement

1 - 6 weeks.

Deliverables

Data Mapping Document


ETL Script / ETL Package in the ETL tool

Possible Pitfalls

There is a tendency to give this particular phase too little development time. This can prove
suicidal to the project because end users will usually tolerate less formatting, longer time to run
reports, less functionality (slicing and dicing), or fewer delivered reports; one thing that they will
not tolerate is wrong information.

A second common problem is that some people make the ETL process more complicated than
necessary. In ETL design, the primary goal should be to optimize load speed without sacrificing
on quality. This is, however, sometimes not followed. There are cases where the design goal is
to cover all possible future uses, whether they are practical or just a figment of someone's
imagination. When this happens, ETL performance suffers, and often so does the performance
of the entire data warehousing system.

OLAP Cube Design

Task Description

Usually the design of the olap cube can be derived from theRequirement Gathering phase. More
often than not, however, users have some idea on what they want, but it is difficult for them to
specify the exact report / analysis they want to see. When this is the case, it is usually a good
idea to include enough information so that they feel like they have gained something through the
data warehouse, but not so much that it stretches the data warehouse scope by a mile.
Remember that data warehousing is an iterative process - no one can ever meet all the
requirements all at once.

Time Requirement

1 - 2 weeks.
Deliverables

Documentation specifying the OLAP cube dimensions and measures.


Actual OLAP cube / report.

Possible Pitfalls

Make sure your olap cube-bilding process is optimized. It is common for the data warehouse to
be on the bottom of the nightly batch load, and after the loading of the data warehouse, there
usually isn't much time remaining for the olap cube to be refreshed. As a result, it is worthwhile
to experiment with the olap cube generation paths to ensure optimal performance.

Front End Development

Task Description

Regardless of the strength of the OLAP engine and the integrity of the data, if the users cannot
visualize the reports, the data warehouse brings zero value to them. Hence front end
development is an important part of a data warehousing initiative.

So what are the things to look out for in selecting a front-end deployment methodology? The
most important thing is that the reports should need to be delivered over the web, so the only
thing that the user needs is the standard browser. These days it is no longer desirable nor
feasible to have the IT department doing program installations on end users desktops just so
that they can view reports. So, whatever strategy one pursues, make sure the ability to deliver
over the web is a must.

The front-end options ranges from an internal front-end development using scripting languages
such as ASP, PHP, or Perl, to off-the-shelf products such as Seagate Crystal Reports, to the
more higher-level products such as Actuate. In addition, many OLAP vendors offer a front-end
on their own. When choosing vendor tools, make sure it can be easily customized to suit the
enterprise, especially the possible changes to the reporting requirements of the enterprise.
Possible changes include not just the difference in report layout and report content, but also
include possible changes in the back-end structure. For example, if the enterprise decides to
change from Solaris/Oracle to Microsoft 2000/SQL Server, will the front-end tool be flexible
enough to adjust to the changes without much modification?

Another area to be concerned with is the complexity of the reporting tool. For example, do the
reports need to be published on a regular interval? Are there very specific formatting
requirements? Is there a need for a GUI interface so that each user can customize her reports?
Time Requirement

1 - 4 weeks.

Deliverables

Front End Deployment Documentation

Possible Pitfalls

Just remember that the end users do not care how complex or how technologically advanced
your front end infrastructure is. All they care is that they receives their information in a timely
manner and in the way they specified.

Report Development

Task Description

Report specification typically comes directly from the requirements phase. To the end user, the
only direct touchpoint he or she has with the data warehousing system is the reports they see.
So, report development, although not as time consuming as some of the other steps such
as ETL anddata modeling, nevertheless plays a very important role in determining the success
of the data warehousing project.

One would think that report development is an easy task. How hard can it be to just follow
instructions to build the report? Unfortunately, this is not true. There are several points the data
warehousing team need to pay attention to before releasing the report.

User customization: Do users need to be able to select their own metrics? And how do users
need to be able to filter the information? The report development process needs to take those
factors into consideration so that users can get the information they need in the shortest amount
of time possible.

Report delivery: What report delivery methods are needed? In addition to delivering the report
to the web front end, other possibilities include delivery via email, via text messaging, or in some
form of spreadsheet. There are reporting solutions in the marketplace that support report
delivery as a flash file. Such flash file essentially acts as a mini-cube, and would allow end users
to slice and dice the data on the report without having to pull data from an external source.

Access privileges: Special attention needs to be paid to who has what access to what
information. A sales report can show 8 metrics covering the entire company to the company
CEO, while the same report may only show 5 of the metrics covering only a single district to a
District Sales Director.
Report development does not happen only during the implementation phase. After the system
goes into production, there will certainly be requests for additional reports. These types of
requests generally fall into two broad categories:

1. Data is already available in the data warehouse. In this case, it should be fairly
straightforward to develop the new report into the front end. There is no need to wait for a major
production push before making new reports available.

2. Data is not yet available in the data warehouse. This means that the request needs to be
prioritized and put into a future data warehousing development cycle.

Time Requirement

1 - 2 weeks.

Deliverables

Report Specification Documentation.


Reports set up in the front end / reports delivered to user's preferred channel.

Possible Pitfalls

Make sure the exact definitions of the report are communicated to the users. Otherwise, user
interpretation of the report can be errenous.

Performance Tuning

Task Description

There are three major areas where a data warehousing system can use a little performance
tuning:

ETL - Given that the data load is usually a very time-consuming process (and hence they
are typically relegated to a nightly load job) and that data warehousing-related batch jobs
are typically of lower priority, that means that the window for data loading is not very
long. A data warehousing system that has its ETL process finishing right on-time is going
to have a lot of problems simply because often the jobs do not get started on-time due to
factors that is beyond the control of the data warehousing team. As a result, it is always
an excellent idea for the data warehousing group to tune the ETL process as much as
possible.
Query Processing - Sometimes, especially in a ROLAP environment or in a system
where the reports are run directly against the relationship database, query performance
can be an issue. A study has shown that users typically lose interest after 30 seconds of
waiting for a report to return. My experience has been that ROLAP reports or reports that
run directly against the RDBMS often exceed this time limit, and it is hence ideal for the
data warehousing team to invest some time to tune the query, especially the most
popularly ones. We present a number of query optimization ideas.

Report Delivery - It is also possible that end users are experiencing significant delays in
receiving their reports due to factors other than the query performance. For example,
network traffic, server setup, and even the way that the front-end was built sometimes
play significant roles. It is important for the data warehouse team to look into these areas
for performance tuning.

Time Requirement

3 - 5 days.

Deliverables

Performance tuning document - Goal and Result

Possible Pitfalls

Make sure the development environment mimics the production environment as much as
possible - Performance enhancements seen on less powerful machines sometimes do not
materialize on the larger, production-level machines.

Query Optimization

For any production database, SQL query performance becomes an issue sooner or later.
Having long-running queries not only consumes system resources that makes the server and
application run slowly, but also may lead to table locking and data corruption issues. So, query
optimization becomes an important task.

First, we offer some guiding principles for query optimization:

1. Understand how your database is executing your query

Nowadays all databases have their own query optimizer, and offers a way for users to
understand how a query is executed. For example, which index from which table is being used
to execute the query? The first step to query optimization is understanding what the database is
doing. Different databases have different commands for this. For example, in MySQL, one can
use "EXPLAIN [SQL Query]" keyword to see the query plan. In Oracle, one can use "EXPLAIN
PLAN FOR [SQL Query]" to see the query plan.

2. Retrieve as little data as possible

The more data returned from the query, the more resources the database needs to expand to
process and store these data. So for example, if you only need to retrieve one column from a
table, do not use 'SELECT *'.

3. Store intermediate results

Sometimes logic for a query can be quite complex. Often, it is possible to achieve the desired
result through the use of subqueries, inline views, and UNION-type statements. For those
cases, the intermediate results are not stored in the database, but are immediately used within
the query. This can lead to performance issues, especially when the intermediate results have a
large number of rows.

The way to increase query performance in those cases is to store the intermediate results in a
temporary table, and break up the initial SQL statement into several SQL statements. In many
cases, you can even build an index on the temporary table to speed up the query performance
even more. Granted, this adds a little complexity in query management (i.e., the need to
manage temporary tables), but the speedup in query performance is often worth the trouble.

Below are several specific query optimization strategies.

Use Index
Using an index is the first strategy one should use to speed up a query. In fact, this
strategy is so important that index optimization is also discussed.
Aggregate Table
Pre-populating tables at higher levels so less amount of data need to be parsed.

Vertical Partitioning
Partition the table by columns. This strategy decreases the amount of data a SQL query
needs to process.

Horizontal Partitioning
Partition the table by data value, most often time. This strategy decreases the amount of
data a SQL query needs to process.

Denormalization
The process of denormalization combines multiple tables into a single table. This speeds
up query performance because fewer table joins are needed.
Server Tuning
Each server has its own parameters, and often tuning server parameters so that it can
fully take advantage of the hardware resources can significantly speed up query
performance.

Quality Assurance

Task Description

Once the development team declares that everything is ready for further testing, the QA team
takes over. The QA team is always from the client. Usually the QA team members will know little
about data warehousing, and some of them may even resent the need to have to learn another
tool or tools. This makes the QA process a tricky one.

Sometimes the QA process is overlooked. On my very first data warehousing project, the project
team worked very hard to get everything ready for Phase 1, and everyone thought that we had
met the deadline. There was one mistake, though, the project managers failed to recognize that
it is necessary to go through the client QA process before the project can go into production. As
a result, it took five extra months to bring the project to production (the original development time
had been only 2 1/2 months).

Time Requirement

1 - 4 weeks.

Deliverables

QA Test Plan
QA verification that the data warehousing system is ready to go to production

Possible Pitfalls

As mentioned above, usually the QA team members know little about data warehousing, and
some of them may even resent the need to have to learn another tool or tools. Make sure the
QA team members get enough education so that they can complete the testing themselves.

Rollout To Production

Task Description
Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Some
may think this is as easy as flipping on a switch, but usually it is not true. Depending on the
number of end users, it sometimes take up to a full week to bring everyone online! Fortunately,
nowadays most end users access the data warehouse over the web, making going production
sometimes as easy as sending out an URL via email.

Time Requirement

1 - 3 days.

Deliverables

Delivery of the data warehousing system to the end users.

Possible Pitfalls

Take care to address the user education needs. There is nothing more frustrating to spend
several months to develop and QA the data warehousing system, only to have little usage
because the users are not properly trained. Regardless of how intuitive or easy the interface
may be, it is always a good idea to send the users to at least a one-day course to let them
understand what they can achieve by properly using the data warehouse.

Production Maintenance

Task Description

Once the data warehouse goes production, it needs to be maintained. Tasks as such regular
backup and crisis management becomes important and should be planned out. In addition, it is
very important to consistently monitor end user usage. This serves two purposes: 1. To capture
any runaway requests so that they can be fixed before slowing the entire system down, and 2.
To understand how much users are utilizing the data warehouse for return-on-investment
calculations and future enhancement considerations.

Time Requirement

Ongoing.

Deliverables

Consistent availability of the data warehousing system to the end users.

Possible Pitfalls
Usually by this time most, if not all, of the developers will have left the project, so it is essential
that proper documentation is left for those who are handling production maintenance. There is
nothing more frustrating than staring at something another person did, yet unable to figure it out
due to the lack of proper documentation.

Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of
the data warehouse planned, start on that as soon as possible.

Incremental Enhancements

Task Description

Once the data warehousing system goes live, there are often needs for incremental
enhancements. I am not talking about a new data warehousing phases, but simply small
changes that follow the business itself. For example, the original geographical designations may
be different, the company may originally have 4 sales regions, but now because sales are going
so well, now they have 10 sales regions.

Deliverables

Change management documentation


Actual change to the data warehousing system

Possible Pitfalls

Because a lot of times the changes are simple to make, it is very tempting to just go ahead and
make the change in production. This is a definite no-no. Many unexpected problems will pop up
if this is done. I would very strongly recommend that the typical cycle of development --> QA -->
Production be followed, regardless of how simple the change may seem.

Observations

This section lists the trends I have seen based on my experience in the data warehousing field:

Quick implementation time

Lack of collaboration with data mining efforts

Industry consolidation

How to measure success

Recipes for data warehousing project failure


Quick Implementation Time

If you add up the total time required to complete the tasks from Requirement
Gathering to Rollout to Production, you'll find it takes about 9 - 29 weeks to complete each
phase of the data warehousing efforts. The 9 weeks may sound too quick, but I have been
personally involved in a turnkey data warehousing implementation that took 40 business days,
so that is entirely possible. Furthermore, some of the tasks may proceed in parallel, so as a rule
of thumb it is reasonable to say that it generally takes 2 - 6 months for each phase of the data
warehousing implementation.

Why is this important? The main reason is that in today's business world, the business
environment changes quickly, which means that what is important now may not be important 6
months from now. For example, even the traditionally static financial industry is coming up with
new products and new ways to generate revenue in a rapid pace. Therefore, a time-consuming
data warehousing effort will very likely become obsolete by the time it is in production. It is best
to finish a project quickly. The focus on quick delivery time does mean, however, that the scope
for each phase of the data warehousing project will necessarily be limited. In this case, the 80-
20 rule applies, and our goal is to do the 20% of the work that will satisfy 80% of the user needs.
The rest can come later.

Lack Of Collaboration With Data Mining Efforts

Usually data mining is viewed as the final manifestation of the data warehouse. The ideal is that
now information from all over the enterprise is conformed and stored in a central location, data
mining techniques can be applied to find relationships that are otherwise not possible to find.
Unfortunately, this has not quite happened due to the following reasons:

1. Few enterprises have an enterprise data warehouse infrastructure. In fact, currently they are
more likely to have isolated data marts. At the data mart level, it is difficult to come up with
relationships that cannot be answered by a good OLAP tool.

2. The ROI for data mining companies is inherently lower because by definition, data mining will
only be performed by a few users (generally no more than 5) in the entire enterprise. As a result,
it is hard to charge a lot of money due to the low number of users. In addition, developing data
mining algorithms is an inherently complex process and requires a lot of up front investment.
Finally, it is difficult for the vendor to put a value proposition in front of the client because
quantifying the returns on a data mining project is next to impossible.

This is not to say, however, that data mining is not being utilized by enterprises. In fact, many
enterprises have made excellent discoveries using data mining techniques. What I am saying,
though, is that data mining is typically not associated with a data warehousing initiative. It seems
like successful data mining projects are usually stand-alone projects.

Industry Consolidation

In the last several years, we have seen rapid industry consolidation, as the weaker competitors
are gobbled up by stronger players. The most significant transactions are below (note that the
dollar amount quoted is the value of the deal when initially announced):

IBM purchased Cognos for $5 billion in 2007.


SAP purchased Business Objects for $6.8 billion in 2007.

Oracle purchased Hyperion for $3.3 billion in 2007.

Business Objects (OLAP/ETL) purchased FirstLogic (data cleansing) for $69 million in
2006.

Informatica (ETL) purchased Similarity Systems (data cleansing) for $55 million in 2006.

IBM (database) purchased Ascential Software (ETL) for $1.1 billion in cash in 2005.

Business Objects (OLAP) purchased Crystal Decisions (Reporting) for $820 million in
2003.

Hyperion (OLAP) purchased Brio (OLAP) for $142 million in 2003.

GEAC (ERP) purchased Comshare (OLAP) for $52 million in 2003.

For the majority of the deals, the purchase represents an effort by the buyer to expand into other
areas of data warehousing (Hyperion's purchase of Brio also falls into this category because,
even though both are OLAP vendors, their product lines do not overlap). This clearly shows
vendors' strong push to be the one-stop shop, from reporting, OLAP, to ETL.

There are two levels of one-stop shop. The first level is at the corporate level. In this case, the
vendor is essentially still selling two entirely separate products. But instead of dealing with two
sets of sales and technology support groups, the customers only interact with one such group.
The second level is at the product level. In this case, different products are integrated. In data
warehousing, this essentially means that they share the same metadata layer. This is actually a
rather difficult task, and therefore not commonly accomplished. When there is metadata
integration, the customers not only get the benefit of only having to deal with one vendor instead
of two (or more), but the customer will be using a single product, rather than multiple products.
This is where the real value of industry consolidation is shown.

How To Measure Success

Given the significant amount of resources usually invested in a data warehousing project, a very
important question is how success can be measured. This is a question that many project
managers do not think about, and for good reason: Many project managers are brought in to
build the data warehousing system, and then turn it over to in-house staff for ongoing
maintenance. The job of the project manager is to build the system, not to justify its existence.

Just because this is often not done does not mean this is not important. Just like a data
warehousing system aims to measure the pulse of the company, the success of the data
warehousing system itself needs to be measured. Without some type of measure on the return
on investment (ROI), how does the company know whether it made the right choice? Whether it
should continue with the data warehousing investment?

There are a number of papers out there that provide formula on how to calculate the return on a
data warehousing investment. Some of the calculations become quite cumbersome, with a
number of assumptions and even more variables. Although they are all valid methods, I believe
the success of the data warehousing system can simply be measured by looking at one criteria:

How often the system is being used.

If the system is satisfying user needs, users will naturally use the system. If not, users will
abandon the system, and a data warehousing system with no users is actually a detriment to the
company (since resources that can be deployed elsewhere are required to maintain the system).
Therefore, it is very important to have a tracking mechanism to figure out how much are the
users accessing the data warehouse. This should not be a problem if third-party reporting/OLAP
tools are used, since they all contain this component. If the reporting tool is built from scratch,
this feature needs to be included in the tool. Once the system goes into production, the data
warehousing team needs to periodically check to make sure users are using the system. If
usage starts to dip, find out why and address the reason as soon as possible. Is the data quality
lacking? Are the reports not satisfying current needs? Is the response time slow? Whatever the
reason, take steps to address it as soon as possible, so that the data warehousing system is
serving its purpose successfully.

Recipes For Failure

This section describes 8 situations where the data warehousing effort is destined to fail, often
despite the best of intentions.

1. Focusing On Idealogy Rather Than Practicality

There are many good textbooks on data warehousing out there, and many schools are offering
data warehousing classes. Having read the textbooks or completed a course, however, does not
make a person a data warehousing guru.

For example, I have seen someone insisting on enforcing foreign key rules between dimension
tables and fact tables, starting from the first day of development. This is not prudent for several
reasons: 1) The development environment by its very nature means a lot of playing with data --
many updates, deletes, and inserts. Having the foreign key constraint only makes the
development effort take longer than necessary. 2) This slows the ETL load time. 3) This
constraint is useless because when data gets loaded into the fact table, we already have to go
to the dimension table to get the proper foreign key, thus already accomplished what a foreign
key constraint would accomplish.

2. Making The Process Unnecessarily Complicated

Data warehousing is inherently a complex enough project, and there is no need to make it even
complex.

Here is an example: The source file comes in with one fact. The person responsible for project
management insists that this fact be broken into several different metrics during ETL. The ideal
sound reasonable: To cut down on the number of rows for the fact table, and so that the front-
end tool could generate the report quicker. Unfortunately, there are several problems with this
approach: First, the ETL became unnecessarily complex. Not only are "case"-type statements
now needed in ETL, but because the fact cannot always be broken down into the corresponding
metrics nicely due to inconsistencies in the data, it became necessary to create a lot of logic to
take care of the exceptions. Second, it is never advisable to design the data model and the ETL
process based on what suits the front end tool the most. Third, at the end of the day, the reports
ended up having to sum back these separate metrics back together to get what the users were
truly after, meaning that all the extra work was for naught.

3. Lack of Clear Ownership


Because data warehousing projects typically touch upon many different departments, it is
natural that the project involves multiple teams. For the project to be successful, though, there
must be clear ownership of the project. Not clear ownership of different components of the
project, but the project itself. I have seen a case where multiple groups each own a portion of
the project. Needless to say, these projects never got finished as quickly as they should, tended
to underdeliver, had inflexbile infrastructure (as each group would do what is best for the group,
not for the whole project). What I have seen coming out of such projects is that it is tailor-made
for finger-pointing. If something is wrong, it's always another group's fault. At the end of the day,
nobody is responsible for anything, and it's no wonder why the project is full of problems.
Making sure one person/one group is fully accountable for the success of the data warehousing
project is paramount in ensuring a successful project.

4. Not Understanding Proper Protocol

Whether you are working as a consultant or an internal resource, you need to understand the
organization protocol in order by build a successful data warehouse / data mart.

I have been in a project where the team thought all the development was done, tested,
documented, migrated to the production system, and ready to deliver by the deadline, and was
ready to celebrate with the bonus money the client promised for an on-time delivery. One thing
that was missed, though, was that the client always requires any production system to go
through its QA group first. The project manager in this case did not notice that. Hence, rather
than delivering a project on time and within budget, the project had to be delayed for an
additional four months before it could go online, all because project management was not
familiar with the organization protocol.

5. Not Fully Understand Project Impact Before The Project Starts

Here, I am talking about the project impact turns out to be much smaller than anticipated. I have
seen data mart efforts where significant amount of resources were thrown into the project, and
at the completion of the project, there were only two users. This is clearly the case where
someone did not make the proper call, as these resources clearly could have been better
utilized in different projects.

6. Try To Bite Off More Than You Can Chew

This means that the project attempts to accomplish something more grandeur than it is
supposed to. There are two examples below:

There are data warehousing projects that attempt to control the entire project -- even to the point
of dictating how the source system should be built to capture data, and exactly how data should
be captured. While the idea is noble -- often during a project, we find that the source system
data has a lot of problems, and hence it makes sense to make sure the source system is built
right -- in reality this is not practical. First, source systems are built the way they are for specific
reasons -- and data analysis should only be one of the concerns, not the only concern. In
addition, this will lead to a data warehousing system that is pretty in theory, but very inflexible in
reality.

In the same vein, I have seen data mart efforts where the project owner attempts to push his
own ideas to the rest of the company, and that person instructed his team to build the system in
a way that can accomodate that possibility. Of course, what really happens is that no one else
ends up adopting his ideas, and much time and effort were wasted.

7. Blindly Sticking To Certain Standards

I have seen cases where a concerted effort is put on ensuring that different data marts employ
the same infrastructure, from the tools used (for example, a certain ETL tool must be used for
doing ETL, regardless of how simple that ETL process is) to the user experience (for example,
users must be able to access the same set of report selection criteria).

This is an absurd way of building data marts. The very reason that different data marts exist is
because there are differences among them, so insisting on making sure they all conform to a
certain standard is an exercise in futility. I have seen ETL tools blindly placed on ETL processes
that require only a series of SQL statements.

As far as the front end goes, that makes even less sense. First of all, different projects, even
though they may be very similar, are still different. Otherwise they would belong to the same
project. Furthermore, users really do not care if their views into different data marts have exactly
the same look and feel. What they care is whether the data is there on time, and whether the
numbers are dependable.

8. Bad Project Management

Bad project management can manifest itself in several ways, and some of the examples listed
previously illustrate the danger of bad project management. In short, it is safe to say a bad
project manager will certain doom a project.

For data warehousing projects, the key is experience, especially hands-on experience. This is
not a job for someone who just completed his or her MBA program, or someone who has only
read through all the data warehousing books, but has had no practical experience.
Data Warehousing Concepts

Dimensional Data Model

Dimensional data model is most often used in data warehousing systems. This is different from
the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can
imagine, the same data would then be stored differently in a dimensional model than in a 3rd
normal form model.

To understand dimensional data modeling, let's define some of the terms commonly used in this
type of modeling:

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time
Dimension.

Hierarchy: The specification of levels that represents relationship between different attributes
within a dimension. For example, one possible hierarchy in the Time dimension is Year
Quarter Month Day.

Fact Table: A fact table is a table that contains the measures of interest. For example, sales
amount would be such a measure. This measure is stored in the fact table with the appropriate
granularity. For example, it can be sales amount by store by day. In this case, the fact table
would contain three columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For
example, the lookup table for the Quarter attribute would include a list of all of the quarters
available in the data warehouse. Each row (each quarter) may have several fields, one for the
unique ID that identifies the quarter, and one or more additional fields that specifies how that
particular quarter is represented on a report (for example, first quarter of 2001 may be
represented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more
lookup tables, but fact tables do not have direct relationships to one another. Dimensions and
hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup
tables.

In designing data models for data warehouses / data marts, the most commonly used schema
types are Star Schema andSnowflake Schema.

Whether one uses a star or a snowflake largely depends on personal preference and business
needs. Personally, I am partial to snowflakes, when there is a business case to analyze the
information at that particular level.

Fact Table Granularity

Granularity

The first step in designing a fact table is to determine thegranularity of the fact table.
By granularity, we mean the lowest level of information that will be stored in the fact table. This
constitutes two steps:

1. Determine which dimensions will be included.


2. Determine where along the hierarchy of each dimension the information will be kept.

The determining factors usually goes back to the requirements.

Which Dimensions To Include

Determining which dimensions to include is usually a straightforward process, because business


processes will often dictate clearly what are the relevant dimensions.

For example, in an off-line retail world, the dimensions for a sales fact table are usually time,
geography, and product. This list, however, is by no means a complete list for all off-line
retailers. A supermarket with a Rewards Card program, where customers provide some
personal information in exchange for a rewards card, and the supermarket would offer lower
prices for certain items for customers who present a rewards card at checkout, will also have the
ability to track the customer dimension. Whether the data warehousing system includes the
customer dimension will then be a decision that needs to be made.
What Level Within Each Dimensions To Include

Determining which part of hierarchy the information is stored along each dimension is a bit more
tricky. This is where user requirement (both stated and possibly future) plays a major role.

In the above example, will the supermarket wanting to do analysis along at the hourly level?
(i.e., looking at how certain products may sell by different hours of the day.) If so, it makes sense
to use 'hour' as the lowest level of granularity in the time dimension. If daily analysis is sufficient,
then 'day' can be used as the lowest level of granularity. Since the lower the level of detail, the
larger the data amount in the fact table, the granularity exercise is in essence figuring out the
sweet spot in the tradeoff between detailed level of analysis and data storage.

Note that sometimes the users will not specify certain requirements, but based on the industry
knowledge, the data warehousing team may foresee that certain requirements will be
forthcoming that may result in the need of additional details. In such cases, it is prudent for the
data warehousing team to design the fact table such that lower-level information is included.
This will avoid possibly needing to re-design the fact table in the future. On the other hand,
trying to anticipate all future requirements is an impossible and hence futile exercise, and the
data warehousing team needs to fight the urge of the "dumping the lowest level of detail into the
data warehouse" symptom, and only includes what is practically needed. Sometimes this can be
more of an art than science, and prior experience will become invaluable here.

Fact And Fact Table Types

Types of Facts

There are three types of facts:

Additive: Additive facts are facts that can be summed up through all of the dimensions
in the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others.

Non-Additive: Non-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that
we are a retailer, and we have a fact table with the following columns:

Date
Store

Product

Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a daily
basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you
can sum up this fact along any of the three dimensions present in the fact table -- date, store,
and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total
sales amount for that week.

Say we are a bank with the following fact table:

Date

Account

Current_Balance

Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each
day, as well as the profit margin for each account for each
day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive
fact, as it makes sense to add them up for all accounts (what's the total current balance for all
accounts in the bank?), but it does not make sense to add them up through time (adding up all
current balances for a given account for each day of the month does not give us any useful
information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for
the account level or the day level.

Types of Fact Tables

Based on the above classifications, there are two types of fact tables:

Cumulative: This type of fact table describes what has happened over a period of time.
For example, this fact table may describe the total sales by product by store by day. The
facts for this type of fact tables are mostly additive facts. The first example presented
here is a cumulative fact table.
Snapshot: This type of fact table describes the state of things in a particular instance of
time, and usually includes more semi-additive and non-additive facts. The second
example presented here is a snapshot fact table.

Star Schema

In the star schema design, a single object (the fact table) sits in the middle and is radially
connected to other surrounding objects (dimension lookup tables) like a star. Each dimension is
represented as a single table. The primary key in each dimension table is related to a forieng
key in the fact table.

Sample star schema

All measures in the fact table are related to all the dimensions that fact table is related to. In
other words, they all have the same level of granularity.

A star schema can be simple or complex. A simple star consists of one fact table; a complex
star can have more than one fact table.

Let's look at an example: Assume our data warehouse keeps store sales data, and the different
dimensions are time, store, product, and customer. In this case, the figure on the left repesents
our star schema. The lines between two tables indicate that there is a primary key / foreign key
relationship between the two tables. Note that different dimensions are not related to one
another.

Snowflake Schema

The snowflake schema is an extension of the star schema, where each point of the star
explodes into more points. In a star schema, each dimension is represented by a single
dimensional table, whereas in a snowflake schema, that dimensional table is normalized into
multiple lookup tables, each representing a level in the dimensional hierarchy.

Sample snowflake schema

For example, the Time Dimension that consists of 2 different hierarchies:

1. Year Month Day


2. Week Day

We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for
month, a lookup table for week, and a lookup table for day. Year is connected to Month, which is
then connected to Day. Week is only connected to Day. A sample snowflake schema illustrating
the above relationships in the Time Dimension is shown to the right.

The main advantage of the snowflake schema is the improvement in query performance due to
minimized disk storage requirements and joining smaller lookup tables. The main disadvantage
of the snowflake schema is the additional maintenance efforts needed due to the increase
number of lookup tables.

Slowly Changing Dimensions


The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In
a nutshell, this applies to cases where the attribute for a record varies over time. We give an
example below:

Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in
the customer lookup table has the following record:

Customer Key Name State


1001 Christina Illinois

At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc.
now modify its customer table to reflect this change? This is the "Slowly Changing Dimension"
problem.

There are in general three ways to solve this type of problem, and they are categorized as
follows:

Type 1: The new record replaces the original record. No trace of the old record exists.

Type 2: A new record is added into the customer dimension table. Therefore, the customer is
treated essentially as two people.

Type 3: The original record is modified to reflect the change.

We next take a look at each of the scenarios and how the data model and the data looks like for
each of them. Finally, we compare and contrast among the three alternatives.

Type 1 Slowly Changing Dimension

In Type 1 Slowly Changing Dimension, the new information simply overwrites the original
information. In other words, no history is kept.

In our example, recall we originally have the following table:

Customer Key Name State


1001 Christina Illinois

After Christina moved from Illinois to California, the new information replaces the new record,
and we have the following table:

Customer Key Name State


1001 Christina California

Advantages:

- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no
need to keep track of the old information.
Disadvantages:

- All history is lost. By applying this methodology, it is not possible to trace back in history. For
example, in this case, the company would not be able to know that Christina lived in Illinois
before.

Usage:

About 50% of the time.

When to use Type 1:

Type 1 slowly changing dimension should be used when it is not necessary for the data
warehouse to keep track of historical changes.

Type 2 Slowly Changing Dimension

In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new
information. Therefore, both the original and the new record will be present. The newe record
gets its own primary key.

In our example, recall we originally have the following table:

Customer Key Name State


1001 Christina Illinois

After Christina moved from Illinois to California, we add the new information as a new row into
the table:

Customer Key Name State


1001 Christina Illinois
1005 Christina California

Advantages:

- This allows us to accurately keep all historical information.

Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the
table is very high to start with, storage and performance can become a concern.

- This necessarily complicates the ETL process.

Usage:

About 50% of the time.

When to use Type 2:

Type 2 slowly changing dimension should be used when it is necessary for the data warehouse
to track historical changes.

Type 3 Slowly Changing Dimension

In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular
attribute of interest, one indicating the original value, and one indicating the current value. There
will also be a column that indicates when the current value becomes active.

In our example, recall we originally have the following table:

Customer Key Name State


1001 Christina Illinois

To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:

Customer Key
Name

Original State

Current State

Effective Date

After Christina moved from Illinois to California, the original information gets updated, and we
have the following table (assuming the effective date of change is January 15, 2003):
Customer Key Name Original State Current State Effective Date
1001 Christina Illinois California 15-JAN-2003

Advantages:

- This does not increase the size of the table, since new information is updated.

- This allows us to keep some part of history.

Disadvantages:

- Type 3 will not be able to keep all history where an attribute is changed more than once. For
example, if Christina later moves to Texas on December 15, 2003, the California information will
be lost.

Usage:

Type 3 is rarely used in actual practice.

When to use Type 3:

Type III slowly changing dimension should only be used when it is necessary for the data
warehouse to track historical changes, and when such changes will only occur for a finite
number of time.

Data Modeling - Conceptual, Logical, And Physical Data Models

The three level of data modeling, conceptual data model,logical data model, and physical data
model, were discussed in prior sections. Here we compare these three types of data models.
The table below compares the different features:

Feature Conceptual Logical Physical


Entity Names
Entity Relationships
Attributes
Primary Keys
Foreign Keys
Table Names
Column Names
Column Data Types

Below we show the conceptual, logical, and physical versions of a single data model.

Conceptual Model Design Logical Model Design Physical Model Design

We can see that the complexity increases from conceptual to logical to physical. This is why we
always first start with the conceptual data model (so we understand at high level what are the
different entities in our data and how they relate to one another), then move on to the logical
data model (so we understand the details of our data without worrying about how they will
actually implemented), and finally the physical data model (so we know exactly how to
implement our data model in the database of choice). In a data warehousing project, sometimes
the conceptual data model and the logical data model are considered as a single deliverable.

Data Integrity

Data integrity refers to the validity of data, meaning data is consistent and correct. In the data
warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no data
integrity in the data warehouse, any resulting report and analysis will not be useful.

In a data warehouse or a data mart, there are three areas of where data integrity needs to be
enforced:

Database level
We can enforce data integrity at the database level. Common ways of enforcing data integrity
include:

Referential integrity

The relationship between the primary key of one table and the foreign key of another table must
always be maintained. For example, a primary key cannot be deleted if there is still a foreign key
that refers to this primary key.

Primary key / Unique constraint

Primary keys and the UNIQUE constraint are used to make sure every row in a table can be
uniquely identified.

Not NULL vs NULL-able

For columns identified as NOT NULL, they may not have a NULL value.

Valid Values

Only allowed values are permitted in the database. For example, if a column can only have
positive integers, a value of '-1' cannot be allowed.

ETL process

For each step of the ETL process, data integrity checks should be put in place to ensure that
source data is the same as the data in the destination. Most common checks include record
counts or record sums.

Access level

We need to ensure that data is not altered by any unauthorized means either during the ETL
process or in the data warehouse. To do this, there needs to be safeguards against
unauthorized access to data (including physical access to the servers), as well as logging of all
data access history. Data integrity can only ensured if there is no unauthorized access to the
data.

What Is OLAP

OLAP stands for On-Line Analytical Processing. The first attempt to provide a definition to OLAP
was by Dr. Codd, who proposed 12 rules for OLAP. Later, it was discovered that this particular
white paper was sponsored by one of the OLAP tool vendors, thus causing it to lose objectivity.
The OLAP Report has proposed the FASMI test, Fast Analysis
of SharedMultidimensional Information. For a more detailed description of both Dr. Codd's rules
and the FASMI test, please visit The OLAP Report.

For people on the business side, the key feature out of the above list is "Multidimensional." In
other words, the ability to analyze metrics in different dimensions such as time, geography,
gender, product, etc. For example, sales for the company is up. What region is most responsible
for this increase? Which store in this region is most responsible for the increase? What
particular product category or categories contributed the most to the increase? Answering these
types of questions in order means that you are performing an OLAP analysis.

Depending on the underlying technology used, OLAP can be braodly divided into two different
camps: MOLAP and ROLAP. A discussion of the different OLAP types can be found in
the MOLAP, ROLAP, and HOLAP section.

MOLAP, ROLAP, And HOLAP

In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and
Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP
and ROLAP.

MOLAP

This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary formats.

Advantages:

Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for
slicing and dicing operations.
Can perform complex calculations: All calculations have been pre-generated when the
cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages:

Limited in the amount of data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the cube
itself. This is not to say that the data in the cube cannot be derived from a large amount
of data. Indeed, this is possible. But in this case, only summary-level information will be
included in the cube itself.
Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are
additional investments in human and capital resources are needed.

ROLAP

This methodology relies on manipulating the data stored in the relational database to give the
appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages:

Can handle large amounts of data: The data size limitation of ROLAP technology is the
limitation on data size of the underlying relational database. In other words, ROLAP itself
places no limitation on data amount.
Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies, since they
sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:

Performance can be slow: Because each ROLAP report is essentially a SQL query (or
multiple SQL queries) in the relational database, the query time can be long if the
underlying data size is large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on generating
SQL statements to query the relational database, and SQL statements do not fit all
needs (for example, it is difficult to perform complex calculations using SQL), ROLAP
technologies are therefore traditionally limited by what SQL can do. ROLAP vendors
have mitigated this risk by building into the tool out-of-the-box complex functions as well
as the ability to allow users to define their own functions.

HOLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-
type information, HOLAP leverages cube technology for faster performance. When detail
information is needed, HOLAP can "drill through" from the cube into the underlying relational
data.

Bill Inmon vs. Ralph Kimball


In the data warehousing field, we often hear about discussions on where a person /
organization's philosophy falls into Bill Inmon's camp or into Ralph Kimball's camp. We describe
below the difference between the two.

Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system.
An enterprise has one data warehouse, and data marts source their information from the data
warehouse. In the data warehouse, information is stored in 3rd normal form.

Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the
enterprise. Information is always stored in the dimensional model.

There is no right or wrong between these two ideas, as they represent different data
warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph
Kimball's idea. This is because most data warehouses started out as a departmental effort, and
hence they originated as a data mart. Only when more data marts are built later do they evolve
into a data warehouse.

Factless Fact Table

A factless fact table is a fact table that does not have any measures. It is essentially an
intersection of dimensions. On the surface, a factless fact table does not make sense, since a
fact table is, after all, about facts. However, there are situations where having this kind of
relationship makes sense in data warehousing.

For example, think about a record of student attendance in classes. In this case, the fact table
would consist of 3 dimensions: the student dimension, the time dimension, and the class
dimension. This factless fact table would look like the following:

The only measure that you can possibly attach to each combination is "1" to show the presence
of that particular combination. However, adding a fact that always shows 1 is redundant because
we can simply use the COUNT function in SQL to answer the same questions.

Factless fact tables offer the most flexibility in data warehouse design. For example, one can
easily answer the following questions with this factless fact table:

How many students attended a particular class on a particular day?


How many classes on average does a student attend on a given day?

Without using a factless fact table, we will need two separate fact tables to answer the above
two questions. With the above factless fact table, it becomes the only fact table that's needed.

Junk Dimension

In data warehouse design, frequently we run into a situation where there are yes/no indicator
fields in the source system. Through business analysis, we know it is necessary to keep those
information in the fact table. However, if keep all those indicator fields in the fact table, not only
do we need to build many small dimension tables, but the amount of information stored in the
fact table also increases tremendously, leading to possible performance and management
issues.

Junk dimension is the way to solve this problem. In a junk dimension, we combine these
indicator fields into a single dimension. This way, we'll only need to build a single dimension
table, and the number of fields in the fact table, as well as the size of the fact table, can be
decreased. The content in the junk dimension table is the combination of all possible values of
the individual indicator fields.

Let's look at an example. Assuming that we have the following fact table:

In this example, the last 3 fields are all indicator fields. In this existing format, each one of them
is a dimension. Using the junk dimension principle, we can combine them into a single junk
dimension, resulting in the following fact table:
Note that now the number of dimensions in the fact table went from 7 to 5.

The content of the junk dimension table would look like the following:
In this case, we have 3 possible values for the TXN_CODE field, 2 possible values for the
COUPON_IND field, and 2 possible values for the PREPAY_IND field. This results in a total of 3
x 2 x 2 = 12 rows for the junk dimension table.

By using a junk dimension to replace the 3 indicator fields, we have decreased the number of
dimensions by 2 and also decreased the number of fields in the fact table by 2. This will result in
a data warehousing environment that offer better performance as well as being easier to
manage.

Conformed Dimension

A conformed dimension is a dimension that has exactly the same meaning and content when
being referred from different fact tables. A conformed dimension can refer to multiple tables in
multiple data marts within the same organization. For two dimension tables to be considered as
conformed, they must either be identical or one must be a subset of another. There cannot be
any other type of difference between the two tables. For example, two dimension tables that are
exactly the same except for the primary key are not considered conformed dimensions.

Why is conformed dimension important? This goes back to the definition of data
warehouse being "integrated." Integrated means that even if a particular entity had different
meanings and different attributes in the source systems, there must be a single version of this
entity once the data flows into the data warehouse.

The time dimension is a common conformed dimension in an organization. Usually the only
rules to consider with the time dimension is whether there is a fiscal year in addition to the
calendar year and the definition of a week. Fortunately, both are relatively easy to resolve. In the
case of fiscal vs calendar year, one may go with either fiscal or calendar, or an alternative is to
have two separate conformed dimensions, one for fiscal year and one for calendar year. The
definition of a week is also something that can be different in large organizations: Finance may
use Saturday to Friday, while marketing may use Sunday to Saturday. In this case, we should
decide on a definition and move on. The nice thing about the time dimension is once these rules
are set, the values in the dimension table will never change. For example, October 16th will
never become the 15th day in October.

Not all conformed dimensions are as easy to produce as the time dimension. An example is the
customer dimension. In any organization with some history, there is a high likelihood that
different customer databases exist in different parts of the organization. To achieve a conformed
customer dimension means those data must be compared against each other, rules must be
set, and data must be cleansed. In addition, when we are doing incremental data loads into the
data warehouse, we'll need to apply the same rules to the new values to make sure we are only
adding truly new customers to the customer dimension.

Building a conformed dimension also part of the process in master data management, or MDM.
In MDM, one must not only make sure the master data dimensions are conformed, but that
conformity needs to be brought back to the source systems.

Master Data Management

Master Data Management (MDM) refers to the process of creating and managing data that an
organization must have as a single master copy, called the master data. Usually, master data
can include customers, vendors, employees, and products, but can differ by different industries
and even different companies within the same industry. MDM is important because it offers the
enterprise a single version of the truth. Without a clearly defined master data, the enterprise
runs the risk of having multiple copies of data that are inconsistent with one another.

MDM is typically more important in larger organizations. In fact, the bigger the organization, the
more important the discipline of MDM is, because a bigger organization means that there are
more disparate systems within the company, and the difficulty on providing a single source of
truth, as well as the benefit of having master data, grows with each additional data source. A
particularly big challenge to maintaining master data occurs when there is a merger/acquisition.
Each of the organizations will have its own master data, and how to merge the two sets of data
will be challenging. Let's take a look at the customer files: The two companies will likely have
different unique identifiers for each customer. Addresses and phone numbers may not match.
One may have a person's maiden name and the other the current last name. One may have a
nickname (such as "Bill") and the other may have the full name (such as "William"). All these
contribute to the difficulty in creating and maintain in a single set of master data.
At the heart of the master data management program is the definition of the master data.
Therefore, it is essential that we identify who is responsible for defining and enforcing the
definition. Due to the importance of master data, a dedicated person or team should be
appointed. At the minimum, a data steward should be identified. The responsible party can also
be a group -- such as a data governance committee or a data governance council.

Master Data Management vs Data Warehousing

Based on the discussions so far, it seems like Master Data Management and Data Warehousing
have a lot in common. For example, the effort of data transformation and cleansing is very
similar to an ETL process in data warehousing, and in fact they can use the same ETL tools. In
the real world, it is not uncommon to see MDM and data warehousing fall into the same project.
On the other hand, it is important to call out the main differences between the two:

1) Different Goals

The main purpose of a data warehouse is to analyze data in a multidimensional fashion, while
the main purpose of MDM is to create and maintain a single source of truth for a particular
dimension within the organization. In addition, MDM requires solving the root cause of the
inconsistent metadata, because master data needs to be propagated back to the source system
in some way. In data warehousing, solving the root cause is not always needed, as it may be
enough just to have a consistent view at the data warehousing level rather than having to ensure
consistency at the data source level.

2) Different Types of Data

Master Data Management is only applied to entities and not transactional data, while a data
warehouse includes data that are both transactional and non-transactional in nature. The
easiest way to think about this is that MDM only affects data that exists in dimensional tables
and not in fact tables, while in a data warehousing environment includes both dimensional
tables and fact tables.

3) Different Reporting Needs

In data warehousing, it is important to deliver to end users the proper types of reports using the
proper type of reporting tool to facilitate analysis. In MDM, the reporting needs are very different
-- it is far more important to be able to provide reports on data governance, data quality, and
compliance, rather than reports based on analytical needs.

4) Where Data Is Used


In a data warehouse, usually the only usage of this "single source of truth" is for applications
that access the data warehouse directly, or applications that access systems that source their
data straight from the data warehouse. Most of the time, the original data sources are not
affected. In master data management, on the other hand, we often need to have a strategy to
get a copy of the master data back to the source system. This poses challenges that do not exist
in a data warehousing environment. For example, how do we sync the data back with the
original source? Once a day? Once an hour? How do we handle cases where the data was
modified as it went through the cleansing process? And how much modification do we need
make do to the source system so it can use the master data? These questions represent some
of the challenges MDM faces. Unfortunately, there is no easy answer to those questions, as the
solution depends on a variety of factors specific to the organization, such as how many source
systems there are, how easy / costly it is to modify the source system, and even how internal
politics play out.

Vous aimerez peut-être aussi