Vous êtes sur la page 1sur 15

Data Warehousing: Similarities and Differences of Inmon and Kimball

minnows

How do the two architectures differ? How Great the Chasm? Is there a common ground? This article attempts to draw out the similarities and differences between the Inmon and Kimball approaches to the data warehouse. On the subject of what the data warehouse is and what the data marts are, both Kimball and Inmon have spoken: The data warehouse is nothing more than the union of all the data marts Ralph Kimball Dec. 29, 1997. You can catch all the minnows in the ocean and stack them together and they still do not make a whale. Bill Inmon Jan. 8, 1998. The Corporate Information Factory (CIF) and the Kimball Data Warehouse Bus (BUS) are considered the two main types of data warehousing architecture. Accordingly, the two architectures have some elements in common. All enterprises require a means to store, analyze and interpret the data they generate and accumulate in order to implement critical decisions that range from continuing to exist to maximizing prosperity. Corporations must develop operating and feedback systems to use the underlying data means (the data warehouse) to achieve their goals. Both the CIF and BUS architectures satisfy these criteria. Another requirement of any data warehouse architecture is that the user can depend on the accuracy and timeliness of the data. The user must also be able to access the data according to his or her particular needs through an easily understandable and straightforward manner of making queries. The data that is extracted in this manner by one user should be compatible with and translatable to other operations and users within the same group or enterprise that rely on the same data. Both Inmon and Kimball share the opinion that stand-alone or independent data marts or data warehouses do not satisfy the needs for accurate and timely data and ease of access for users on an enterprise or corporate scale. In an article for the Business Intelligence Network, Mr. Inmon writes: Independent data marts may work well when there are only a few data marts. But over time there are never only a few data marts ... Once there are a lot of data marts, the independent data mart approach starts to fall apart. There are many reasons why independent data marts built directly from a legacy/source environment fall apart:

There is no single source of data for analytical processing ; There is no easy reconcilability of data values ; There is no foundation to build on for new data marts An independent data mart is rarely reusable for other purposes; There are too many interface programs to be built and maintained; There is a massive redundancy of detailed data in each data mart ... because there is no common place where that detailed data is collected and integrated; There is no convenient place for historical data; There is no low level of granularity guaranteed for all data marts to use; Each data mart integrates data from the source systems in a unique way, which does not permit reconcilability or integrity of the data across the enterprise; and The window for extracting data from the legacy environment is stretched with each independent data mart requiring its own window of time for extraction

In Differences of Opinion (previously cited), Mr. Kimball gives his opinion of independent data marts: Finally stand-alone data marts or warehouses are problematic. These independent silos are built to satisfy specific needs, without regard to other existing or planned analytic data. They tend to be departmental in nature, often loosely dimensionally structured. Although often perceived as the path of least resistance because no coordination is required, the independent approach is unsustainable in the long run. Multiple, uncoordinated extracts from the same operational sources are inefficient and wasteful. They generate similar, but different variations with inconsistent naming conventions and business rules. The conflicting results cause confusion, rework and reconciliation. In the end, decision-making based on independent data is often clouded by fear, uncertainty and doubt. It appears from the above, that both Inmon and Kimball are of the opinion that independent or stand-alone data marts are of marginal use. However, for the most part, this is where the perception of similarity stops. You may discern later, as I have, that there are more similarities, but each of our data warehouse architects expresses them in a very different way. Inmon believes that Kimballs star schema-only approach causes inflexibility and therefore leads to a brittle structure. He writes this basic lack of flexibility is at the heart of the weakness of the star schema model as the basis of the data warehouse ... When there is an enterprise need for data the star schema is not at all optimal. Taken together, a series of star schemas and multi-dimensional tables are brittle ... [They] cannot change gracefully over time Mr. Inmon believes his approach, which uses the dependent data mart as the source for star schema usage, solves the problem of enterprise-wide access to the same data, which can change over time.

The relational data warehouse is best served by a relational [3NF] database design running on relational technology This should be no surprise since the dbms technology the data warehouse runs on works the best with a relational database design. The Kimball BUS architecture expresses that raw data is transformed into presentable information in the staging area, ever mindful of throughput and quality. Staging begins with coordinated extracts from the operational source systems. Some staging kitchen activities are centralized, such as maintenance and storage of common reference data, while others may be distributed. (Data Warehouse Dining Experience, Intelligent Enterprise, Jan 1, 2004.) The above indicates to this author that Kimball has gone beyond the individual star schema approach, criticized by Inmon and, in fact, has described his multi-dimensional data warehouse. In this approach, the model contains atomic data and the summarized data, but its construction is based on business measurements, which enable disparate business departments to query the data from a higher level of detail to the lowest level without reprogramming. Although this description appears to indicate that the Kimball staging area is VERY similar to the Inmon data warehouse, the Kimball approach does not recommend a real, physically implemented, data warehouse. His data warehouse is still the collection of data marts with their conformed dimensions. In Mastering Data Warehouse Design: Relational and Dimensional Techniques, by Claudia Imhoff, Nicholas Galemmo and Jonathan Geiger (Wiley, 2003), these authors analyze the Kimball approach as relying on star schemas for both atomic and aggregated storage. Summarizing this point of their research, the Data Warehouse Bus Architecture is said to consist of two types of data marts: The Atomic Data Marts, which hold multi-dimensional data at the lowest level. These can also include aggregated data for improved query performance. Aggregated Data Marts. These can store data according to a core business process.

In both the Atomic and Aggregated Data Marts, the data is stored in a star schema design. Their description of the Kimball Bus Architecture seems to indicate that the Kimball Approach still does not recognize a need for nor require a central data warehouse repository.

The next article will highlight the differences in the two models regarding relational vs. multidimensional data.

Data Mart vs Data Warehouse - The Great Debate


There are far too many conflicting and confusing definitions of Data Mart and Data Warehouse floating around. The long running debate between Ralph Kimball and Bill Inmon, the two Titans of Data Warehousing, only adds to the confusion. In this post, well try to get some sanity around the concepts, without getting drawn (hopefully) into the crossfire. A Data Mart is a specific, subject oriented, repository of data designed to answer specific questions for a specific set of users. So an organization could have multiple data marts serving the needs of marketing, sales, operations, collections, etc. A data mart usually is organized as one dimensional model as a star-schema (OLAP cube) made of a fact table and multiple dimension tables. In contrast, a Data Warehouse (DW) is a single organizational repository of enterprise wide data across many or all subject areas. The Data Warehouse is the authoritative repository of all the fact and dimension data (that is also available in the data marts) at an atomic level. Unfortunately, this is where consensus begins to break down into chaos. There are two broad schools of thought lead by Kimball and Inmon that disagree on the details. Kimball School: Ralph Kimball began with the Data Mart as a dimensional model for departmental data and viewed the Data Warehouse as the enterprise wide collection of Data Marts. This is the bottom-up approach. You may begin with the Sales Data Mart, after sometime you put in place the Ops Data Mart, and so on an so forth. If you want you could have even more specific Data Marts serving specific questions like customer Churn. If you take care of consistency of metadata (making sure each departmental Data Mart calls an Apple an Apple) and connectivity, you have a Data Warehouse. So the Data Warehouse is really a virtual collection of Data Marts collected together on a Data Warehouse Bus, and in that sense the data flows from multiple Marts into the Warehouse. Inmon School: Inmons approach is the exact opposite and avoids the problem of metadata consistency by looking at the Enterprise Data Warehouse as a single repository that feeds subject oriented Data Marts. You still have your Sales, Marketing, Ops and Churn Data Marts containing atomic or aggregated information, but they are based on the Data Warehouse and are really subsets of the data contained therein. This is the top-down approach. Kimballs approach is easier to implement as you are dealing with smaller subject areas to begin with, but the end result often has meta data inconsistencies and can be a nightmare to integrate. Inmons approach, on the other hand does not defer the integration and consistency issues, but takes far longer to implement (which makes it

easier for the project to fail). Also, in my experience, organizations that are just starting to do analytics usually do not have the patience or commitment required for Inmons approach. Any BI initiative is extremely iterative in nature. Unless you are confident that you would still have the CEOs buy-in and a budget one year down the line, it might be better to begin with a Data Mart (to start delivering, and to manage expectations) keeping the meta data consistency requirements in mind, and then scale towards the Data Warehouse. If you are interested in knowing more about the Great Debate, you can read an article by Katherine Drewek called Data Warehousing: Similarities and Differences of Inmon and Kimball. P.S.: This is what I think, and Ill be glad to hear of your experiences and views, specially if you have been a part of an Inmon style implementation.

What are the differences between an enterprise ODS and an e-business Web-based ODS?
For years now, the corporate information factory has recognized the operational data store (ODS) as a major component. The operational data store is the place where there is the possibility of online transaction processing (OLTP) response times. The operational data store is where operational data can be integrated and where both online processing and decision support system (DSS) analytical processing can be combined. Called by various names such as real-time data warehousing and online data warehousing, the ODS is as ubiquitous as the data warehouse. However, there is another kind of ODS that occasionally appears in the corporate information factory, and that operational data store is found in the Web processing environment. When an organization has a large Web-based environment, it is common to have an ODS whose sole purpose is the care and feeding of the Web-based e-business environment. There are then two different kinds of ODSs that are sometimes found in the corporate information factory an enterprise ODS and an ebusiness Web-based ODS. What are the differences between these two different kinds of ODSs? The first and most obvious difference is that of the architectural positioning of the two different kinds of ODSs. The enterprise operational data store is open to and serves the entire enterprise. Data from all different systems and different networks have access to the enterprise ODS. The e-business ODS has access to only the immediate Web site and traffic and data coming off of the Web. As an example of the differences between these two architectural placements, consider the contents of an enterprise ODS. The enterprise operational data store might have such things as:

Enterprise customer lists. Enterprise customer profiles. Data used across the enterprise.

In short, anything found inside the enterprise is subject to being found inside the enterprise ODS. Now consider what might be found in the e-business ODS. In the e-business ODS, there might be things such as:

The relation between a cookie and other identifying information about a person using the Web environment. When a person last visited a Web site. The preferences of the person using the e-business Web site, based on past proclivities. Promotions and special offers made to viewers of the Internet.

It is obvious that these types of ODSs are very different, and both serve different needs of the enterprise.

Recent articles by Bill Inmon


Bill Inmon -

Where's the Loyalty? The Imperfect Data Warehouse, Part 2 Metadata, Part 6

Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772. Editor's note: More Bill Inmon articles, resources, news and events are available in the Business Intelligence Network's Bill Inmon Channel. Be sure to visit today!

Different Aspects of Data Warehouse Architecture


This page is a list of the different aspects of data warehouse architecture. Architecture is a pretty nebulous term. I think of architecture as a system design decision that is usually not easily changed. The decision is not easily changed because the amount of work, money, and politics involved in doing so. This a list of aspects of architecture that the data warehouse decision maker will have to deal with themselves. There are many other architecture issues that affect the data warehouse, e.g., network topology, but these have to be made with all of an organization's systems in mind (and with people other than the data warehouse team being the main decision makers.) This list will not attempt to provide detailed explanations of the different types of architecture. Rather, I am presenting this list because the data warehousing literature usually muddles the subject of architecture by lumping different types of decisions together or by forgetting certain types of decisions. Also, the literature makes these decisions seem much more black and white than they are. For example, in the area of what I call reporting and staging data store architecture, much of the literature discusses only the "enterprise" data warehouse, the dependent data mart, and the independent data mart options. In reality, there are many more variations being used that cannot easily be given a snappy label.

Data consistency architecture


Doug Hackney's excellent but confusingly titled article on what he calls incremental data mart enterprise architecture is the most succinct statement of what this means. This is the choice of what data sources, dimensions, business rules, semantics, and metrics an organization chooses to put into common usage. (Though the article does not say it explicitly, it is also the equally important choice of what data sources, dimensions, business rules, semantics, and metrics an organization chooses not to put into common usage.) This is by far the hardest aspect of architecture to implement and maintain because it involves organizational politics. However, determining this architecture has

more to do with determining the place of the data warehouse in your business than any other architectural decision. In my opinion, the decisions involved in determining this architecture should drive all other architectural decisions. Unfortunately, this determination of this architecture seems to often be backed into than consciously made.

Reporting data store and staging data store architecture


The main reasons we store data in a data warehousing systems are so they can be: 1) reported against, 2) cleaned up, and (sometimes) 3) transported to another data store where they can be reported against and/or cleaned up. Determining where we hold data to report against is what I call the reporting data store architecture. All other decisions are what I call staging data store architecture. As mentioned before, there are infinite variations of this architecture. Many writings on this aspect or architecture take on a religious overtone. That its, rather than discussing what will make most sense for the organization implementing the data warehouse, the discussion is often one of architectural purity and beauty or of the writer's conception of rightness and wrongness.

Data modeling architecture


This is the choice of whether you wish to use denormalized, normalized, object-oriented, proprietary multidimensional, etc. data models. As you may guess, it makes perfect sense for an organization to use a variety of models.

Tool architecture
This is your choice of the tools you are going to use for reporting and for what I call infrastructure.

Processing tiers architecture


This is your choice of what physical platforms will do what pieces of the concurrent processing that takes place when using a data warehouse. This can range from an architecture as simple as host-based reporting to one as complicated as the diagram on page 32 of Ralph Kimball's "The Data Webhouse Toolkit".

Security architecture
If you need to restrict access down to the row or field level, you will

probably have to use some other means to accomplish this other than the usual security mechanisms at your organization. Note that while security may not be technically difficult to implement, it can cause political consternation. As a final comment, let me assert that in the long run, decisions on data consistency architecture will probably have much more influence on the return of investment in the data warehouse than any other architectural decisions. To get the most return from a data warehouse (or any other system), business practices have to change in conjunction with or as a result of the system implementation. Conscious determination of data consistency architecture is almost always a prerequisite to using a data warehouse to effect business practice change.

Differences of Opinion
The Kimball bus architecture and the Corporate Information Factory: What are the fundamental differences?
By Margy Ross , Ralph Kimball Based on recent inquiries, many of you are in the midst of architecting (or rearchitecting) your data warehouse. There's no dispute that planning your data warehouse from an enterprise perspective is a good idea, but do you need an enterprise data warehouse? It depends on your definition. In this column, we'll clarify the similarities and differences between the two dominant approaches to enterprise warehousing.

Common Ground
We all agree on some things related to data warehousing. First, at the most rudimentary level, nearly all organizations benefit from creating a data warehouse and analytic environment to support decision-making. Maybe it's like asking your barber if you need a haircut, but personal bias aside, businesses profit from well-implemented data warehouses. No one would attempt to run a business without operational processes and systems in place. Likewise, complementary analytic processes and systems are needed to leverage the operational foundation. Second, the goal of any data warehouse environment is to publish the "right" data and make it easily accessible to decision makers. The two primary components of this environment are staging and presentation. The staging (or acquisition) area consists of extract-transform-load (ETL) processes and support. Once the data is properly prepared, it is loaded into the presentation (or delivery) area where a variety of query, reporting, business intelligence, and analytic applications are used to probe, analyze, and present data in endless combinations. Both approaches agree that it's prudent to embrace the enterprise vantage point when architecting the data warehouse environment for long-term integration and extensibility. Although subsets of the data warehouse will be implemented in phases over time, it's beneficial to begin with the integrated end goal in mind during planning. Finally, standalone data marts or warehouses in Figure 1 are problematic. These independent silos are built to satisfy specific needs, without regard to other existing or planned analytic data. They tend to be departmental in nature, often loosely dimensionally structured. Although often perceived as the path of least resistance because no coordination is required, the independent approach is unsustainable in the long run. Multiple, uncoordinated extracts from the same operational sources are inefficient and wasteful. They generate similar, but different variations with inconsistent naming conventions and business rules. The conflicting results cause confusion, rework and reconciliation. In the end, decision-making based on independent data is often clouded by fear, uncertainty, and doubt.

Figure 1 Independent data marts/warehouses. So we all see eye to eye on some matters. Before turning to our differences of opinion, we'll review the two dominant approaches to enterprise data warehousing.

Kimball Bus Architecture


If you've been regularly reading the last 100 or so columns, you're familiar with the Kimball approach in Figure 2. As we described in our last column "Data Warehouse Dining Experience" (Jan. 1, 2004), raw data is transformed into presentable information in the staging area, ever mindful of throughput and quality. Staging begins with coordinated extracts from the operational source systems. Some staging "kitchen" activities are centralized, such as maintenance and storage of common reference data, while others may be distributed.

Figure 2 Dimensional data warehouse. The presentation area is dimensionally structured, whether centralized or distributed. A dimensional model contains the same information as a normalized model, but packages it for ease-of-use and query performance. It includes both atomic detail and summarized information (aggregates in relational tables or multidimensional cubes) as required for performance or geographic distribution of the data. Queries descend to progressively lower levels of detail, without reprogramming by the user or application designer. Dimensional models are built by business process (corresponding to a business measurement or event), not business departments. For example, orders data is populated once in the dimensional data warehouse for enterprise access, rather than being replicated in three departmental marts for marketing, sales, and finance. Once foundation business processes are available in the warehouse, consolidated dimensional models deliver cross-

process metrics. The enterprise data warehouse bus matrix identifies and enforces the relationships between business process metrics (facts) and descriptive attributes (dimensions).

Corporate Information Factory


Figure 3 illustrates the Corporate Information Factory (CIF) approach, once known as the EDW approach. Like the Kimball approach, there are coordinated extracts from the source systems. From there, a third normal form (3NF) relational database containing atomic data is loaded. This normalized data warehouse is used to populate additional presentation data repositories, including special-purpose warehouses for exploration and data mining, as well as data marts.

Figure 3 Normalized data warehouse with summary dimensional marts (CIF). In this scenario, the marts are tailored by business department/function with dimensionally structured summary data. Atomic data is accessible via the normalized data warehouse. Obviously, the atomic data is structured very differently from summarized information.

Fundamental Differences
There are two fundamental differentiators between the CIF and Kimball approaches. The first concerns the need for a normalized data structure before loading the dimensional models. Although this is a requisite underpinning of the CIF, the Kimball approach says the data structures required prior to dimensional presentation depend on the source data realities, target data model, and anticipated transformation. Although we don't advocate centrally normalizing the atomic data prior to loading the dimensional targets, we don't absolutely admonish it, presuming there's a real need, financial willpower (for both the redundant ETL development and data storage), and clear understanding of the two-step throughput. In the vast majority of the cases, we find duplicative storage of core performance measurement data in both normalized and dimensional structures is unwarranted. Advocates of the normalized data structures claim it's faster to load than the dimensional

model, but what sort of optimization is really achieved if the data needs to undergo ETL multiple times before being presented to the business? The second primary difference between the two approaches is the treatment of atomic data. The CIF says atomic data should be stored in the normalized data warehouse. The Kimball approach says atomic data must be dimensionally structured. Of course, if you only provide summary information in a dimensional structure, you've "presupposed the questions." However, if you make atomic data available in dimensional structures, you always have the ability to summarize the data "any which way." We need the most finely grained data in our presentation area so that users can ask the most precise questions possible. Business users may not care about the details of a single atomic transaction, but we can't predict the ways they'll want to summarize the transaction activity (perhaps for all customers of a certain type living in a range of ZIP codes that have been customers for more than two years). Anyone that's worked side by side with a business analyst knows the questions asked are unpredictable and constantly changing. Details must be available so that they can be rolled up to answer the questions of the moment, without encountering a totally different data structure. Storing the atomic data in dimensional structures delivers this fundamental capability. If only summary data is dimensional with the atomic data stored in normalized structures, then drilling into the details is often akin to running into a brick wall. Skilled professionals must intervene because the underlying data structures are so different. Both approaches advocate enterprise data coordination and integration, but the implementations differ. CIF says the normalized data warehouse fills this role. While normalized models communicate data relationships, they don't inherently apply any pressure to resolve data integration issues. Normalization alone doesn't begin to address the common data keys and labels required for integration. From the earliest planning activities, the Kimball approach uses the enterprise data warehouse bus architecture with common, conformed dimensions for integration and drill-across support. Common, conformed dimensions have consistent descriptive attribute names, values, and meanings. Likewise, conformed facts are consistently defined; if they don't use consistent business rules, then they're given a distinct name. Conformed dimensions are the backbone of any enterprise approach as they provide the integration "glue." We've provided detailed design and implementation guidance regarding conformed dimensions in our books and columns. Conformed dimensions are typically built and maintained as central persistent master data in the staging area, then reused across the enterprise's presentation marts and warehouses to ensure data integration and semantic consistency. It may not be practical or useful to get everyone to agree to everything related to a dimension; however, conformity directly correlates to an organization's ability to integrate business results. Without conformity, you end up with isolated data that cannot be tied together. This situation perpetuates incompatible views

of the enterprise, diverting attention to data inconsistencies and reconciliations while degrading the organization's decision-making ability.

Hybrid Approach?
Some organizations adopt a hybrid approach, theoretically marrying the best of both the CIF and Kimball methods. As shown in Figure 4, the hybrid combines Figure 1 and 2. There's a normalized data warehouse from the CIF, plus a dimensional data warehouse of atomic and summary data based on the Kimball bus architecture.

Figure 4 Hybrid of normalized data warehouse and dimensional data warehouse. Given the final presentation deliverable, this approach is viable. However, there are significant incremental costs and time lags associated with staging and storing atomic data redundantly. If you're starting with a fresh slate and appreciate the importance of presenting atomic data to avoid presupposing the business questions, then why would you want to ETL, store, and maintain the atomic details twice? Isn't the value proposition more compelling to focus the investment in resources and technology into appropriately publishing additional key performance metrics for the business? Of course, if you have already built a normalized data warehouse and now recognize the need for robust presentation capabilities to deliver value, then the hybrid approach lets you leverage your preexisting investment.

Success Criteria
When evaluating approaches, people often focus strictly on IT's perception of success, but it must be balanced with the business' perspective. We agree that data warehouses should deliver a flexible, scalable, integrated, granular, accurate, complete, and consistent environment; however, if it's not being leveraged by the business for decision-making, then it's impossible to declare it successful. Does the business community use the data warehouse? Is it understandable to them (including nontechnical users)? Does it answer their business questions? Is the performance acceptable from their vantage point? In our opinion, the success of any data warehouse is measured by the business's acceptance of the analytic environment and their benefits realized from it. You should

choose the data warehouse architecture that best supports these success criteria, regardless of the label.

Vous aimerez peut-être aussi