Vous êtes sur la page 1sur 12

DATA WAREHOUSING- AN OVERVIEW

Outline
Introduction
View - I: Data Access Crisis
View - II: Operational Applications vs. Analytical Applications
View – III:
Data Integration
View – IV:
Text Book View
Overview
Conclusion

Introduction

Welcome to the course of Data Warehousing and Data Mining.

You might have seen when you go to a super market that your bill is processed in a
computerized billing system. When you go to a railway station, you book the ticket in
a computerized system. Even in the airlines, the ticket is booked in a computerized
system. So almost in everything computerization is abundantly followed. And you will
see that all these organizations are collecting data effortlessly and the amount of
data they are collecting everyday is large in volume.

Taking the example of a railway reservation system, except for managerial decisions,
the details of who has booked a ticket and on which train is sent to an archival file
after the train or after the commencement of the journey. As a result, huge volumes
of data, which is effortlessly generated, is sent to an archival file without any proper
use.

Particularly in a business environment this kind of data can be made use of to have
better profitability and better performance. Keeping that in mind new technology that
has been developed that tries to make use of the historical data and brings out some
useful information from that data and which we shall be learning in this particular
course called Data Warehousing. Data Warehousing essentially means that to extract
information, very useful and very valuable information from huge volume of data and
it is useful for the decision making purpose.
In this lesion we will take care of the general overview of the whole concept of Data
Warehousing starting with the introductory concepts of Data Warehousing and then
proceeding to detailed components of Data Warehousing.

View – I:
Data Access Crisis

We have three views on Data Warehousing. The first one is Data Access Crisis. Lets
go through the details of Data Access Crisis.

The Data Access Crisis

The single key to survival in the 1990s (and beyond)


Being able to analyze, plan and react to changing business conditions in a much
more rapid fashion.

And to do this, top managers, analysts and knowledge workers in our enterprises
need more and better information.

"Data in Jail" – The Data Access Crisis

Information technology itself has made possible revolutions in the way that
organizations today operate throughout the world.

Despite the availability of more and more powerful computers on everyone's desks
and Communication networks that span the globe, executives and decision makers
can't get their hands on critical information that already exists in the organization.

“Data in Jail”

Organizations- large and small, create


–Billions of bytes of data about all aspects of their business,
–Millions of individual facts about their customers, products, operations and people.

But for the most part, this data is locked up in a myriad of computer systems and is
exceedingly difficult to get at.

This phenomenon has been described as "data in jail".

Data Poor

Only a small fraction of the data that is captured, processed and stored in the
enterprise is actually available to executives and decision makers.
Technologies for the manipulation and presentation of data have literally exploded
Large segments of the enterprise are still "data poor."

Data Warehousing

Providing Data Access to the Enterprise.

A set of significant new concepts and tools has evolved.


–Providing all the key people in the enterprise with access to whatever level of
information needed for the enterprise to survive and prosper in an increasingly
competitive world.

The term that has come to characterize this new technology is "data warehousing."
–To provide organizations flexible, effective and efficient means of getting at the sets
of data that has come to represent one of the organizations most critical and
valuable assets.

What did we learn?

So in view – I we learn that e


nterprise-wise information should be made available for decision-making purpose.

View - II:
Operational Applications vs. Analytical Applications

View – II is to evaluate Data Warehousing in the context of operational and analytical


applications.

Database Applications

Let us take the Database applications for example

There are two types of database applications in a business environment.


–Operational applications
- Conventional Transaction Processing Systems
–Analytical applications
-More comprehensive view of the business data
-Cannot be handled by operational applications
Transaction Processing System

Business process involves a series of events- Nature and frequency may differ. We
call these as transactions.

Highly efficient and optimized execution of large number of atomic transactions and
near fault-tolerant availability of data.
–A product is manufactured
–One account is credited and another debited
–A seat is reserved
–An order is booked
–An invoice is generated
–A payment is posted

Operational Data

- Constantly being changed and updated


- Represents current information at one point of time- current state of the system
- Very high volatility- it can change at any moment
- Can show
- Pending order,
- Current balance of a checking account,
Number of items currently out-of-stock.

Limitations

- Customer-billing system shows structure of invoice lines, relationship with shipment


and order.

-
Do not show the way manager looks at the process of booking orders, picking
shipments, invoicing customers.

-
These are optimized to carry out the transaction efficiently and correctly.

What, Who Vs Why, What-If

- Managers’ concerns
- Sales, Volumes, Margins
- By product, By division, Over time

- No longer What and Who


- More of Why and What-if

Analytical Applications

Class of applications
–That support analyst and knowledge workers in their efforts to gain insight into
data.


–Through fast, interactive access to a wide range of corporate information.

–It transforms the raw data so that it reflects the real dimensionality of the
enterprise.

FASMI

-FAST
-Most responses within 5 sacs, should not exceed 20secs.
-ANALYSIS
-Time series analyses, cost allocation, exception alerting (Without having to do any
programming)
-SHARED
-MULTIDIMENSIONAL
-Single most important feature
-INFORMATION

Analytical Data

-Historical data
-
-Static in nature

-Used to look at information over periods of time.

-Usually built from operational data


-
-Not necessarily from within the organization.
-Total sales in January
-Number of roses sold on Valentine day.

Operational vs. Analytical data

Content changes Real-time daily to monthly


Structural infrequent daily to monthly
changes
Detail level Transaction summary level
QCF Low very high
User interface Static, app- Dynamic, business-
dep problem dependent
Response time Real-time Real-time
Age of data Current Historical
Access Path Deterministic Non- Deterministic
Users Rule-based Knowledge-based
History of Analytical Process

Before the 1960s- Manual


-60’s: SPREADSHEETS
Excellent presentation tool, but hard to enter info from other sources
-70’s: Terminal-based DSS and EIS (executive information systems)
Still inflexible, not integrated with desktop tools
Entire application is built to create single report. Extremely complicated, uses very
long stored procedures
80’s: Desktop data access and analysis tools
Query tools, spreadsheets, GUIs easier to use, but only access operational databases
90’s: Data warehousing with integrated OLAP engines and tools

We learnt the following

Analytical processing- not transaction processing


Query and Browse- not simply query
Access Path Weakly specified and non-deterministic

View – III: Data Integration

We go to the next view of Data Warehousing, which I call it as Data Integration.

Types of Data Integration

We have two types of data Integration. On-demand Integration and In-advance


integration.

 In Advance Integration

 On-Demand-Integration

On-Demand-Integration

This is also called as Lazy model; it is also called as Query-driven model or also
called as “Virtual” system. To be precise about On-demand-integration lets say that
given a query we want to find the relevant information sources where the data is
available, generate a sub query for each of the sources, integrate the results
obtained from different sources, and return them to the specified application.

To be more elaborate, let us imagine that a query that is required to access four to
five different data sources. One may be in Informix other may be in Oracle another
may be Sybase. So the user gives a query then the On-demand-integration receives
that query understands that query and generates different sub-queries one query for
Informix another query for Oracle another for Sybase. Those sub-queries are sent to
different systems, the respective systems of Sybase and Oracle. The data is collected
from the systems integrated and the integrated result is sent to the user.

The diagram explains the concept very clearly. Let us look at the different sources.
Let us imagine three sources an Oracle source, an Informix Source and a Sybase
Source. The client gives a query the system, which receives the query makes the
sub-queries, sends the Oracle sub-queries to the Oracle source , Informix sub-query
to the Informix source and Sybase sub-query to the Sybase source.

These systems are specific to the DBMS present in the source. So the query is sent
here, it is understood and the data is received back to the wrapper and this data sent
to the integrator which integrates the results obtained from all the sources and
finally it sends back the result to the client. So the client feels as if he is sending a
query, which can be understood by Sybase, Informix and Oracle simultaneously.
Actually the system integrates this particular query into Sub-queries. This concept is
called On-demand-integration because as and when the client gives a query the sub-
query generations, obtaining data from different sources and integrating those
results and sending to the client is carried out.

In-Advance-Integration

There is another type of integration, called In-advance-integration. This is called


Eager model, Analysis-driven model or “Materialized” Systems. In this system the
relevant information is extracted from the sources in advance. It is filtered,
consolidated, and stored in a (separate) database.

When the user poses the query it evaluates those queries directly from the new
database and the result is returned to the client.

This is called Eager model because the system evaluates the possible set of queries
in-advance, gets the data from multiple sources, prepares the results and keep it
ready for the user to give the query and all these process are carried out even before
a user makes any demand.

This is called analysis-driven because the possible set of analysis that can be carried
out by a user is estimated in advance and all the results are generated and kept.

This is also called as Materialized system because all the consolidated, filtered data is
physically stored in a new database.

Materialized Systems

To be more precise about Materialized Systems, data coming from local sources is
integrated into a single new database.

In the materialized system, we have two different ways of storing the data.

Universal DBMS - The universal DBMS migrates data from local systems to an Object
Relational or Object Oriented database system that can handle novel types of data.
Data Warehouse - Data warehousing it imports data from local sources for the
purposes of OLAP and data mining.
Let us have a complete picture of all the different kinds of integration that are
possible. Integration can be of two different ways:

Virtual system, which I call as Lazy model or On-demand model.


Materialized system, which I said as Eager model or In-advance model. It is called
Lazy model because the system waits for the user to give the queries, and then
evaluates that queries and gets the data.

The Materialized System is called Eager model because the system before the query
is put it evaluates the possible set of queries that may be sent by the user and gets
the data and makes it ready for the query on certain system.

In the virtual system, one category falls in the search engine and other is multi
DBMS system and another is mediated integration system.

In the materialized system it can be a universal DBMS or it can be Data Warehouse.

This gives you a complete picture of where Data Warehouse lies in the integration
system.

Materialized vs. Virtual Systems

You have noticed that materialized system is basically an In-advance system or an


Eager system. In an Eager system let me repeat again that it is essentially to guess
the set of queries a user would be asking and prepare the answers for those queries
and store it.

And in On-demand system only after the query the data is picked up from the
different sources .Hence the materialize system is preferable when the
network connectivity is unreliable, because you know that the moment the query is
put the network connection is not available. When connecting to different sources
and getting the data would not be possible.

The response time to queries is another important aspect of this. The data is
precompiled and kept. The moment the user gives a query immediately the response
is given.
Where as in Lazy model, if after the query is given the processing starts, then it will
take a long time to get back the answer from different sources.

The third point of preferring materialized system to a virtual system is that it is


cheaper to
Materialize and incrementally maintain intricate relationships rather than recomputed
them each time they are needed. Whenever there is a repeated set of queries given
or similar queries are given, then Lazy model or Virtual system is a very inefficient
way of doing it because the same query is being repeatedly sent to different sub
systems and obtaining the result, so these are the three main aspects of Materialized
systems over virtual systems.

Thus we learnt that data warehouse is materialized integration of data from multiple
sources.
View – IV:
Text Book View

Data Warehouse: Definition

A DATA WAREHOUSE IS SUBJECT-ORIENTED, INTEGRATED, TIME-VARIANT, NON-


VOLATILE COLLECTION OF DATA IN SUPPORT OF MANAGEMENT’S DECISION
MAKING PROCESS.

- W. H. Inmon, 1993

Data Warehouse: An Overview

Having understood what essentially a Data Warehouse is let us have a review of the
whole course of Data Warehousing and Data Mining.

Data Warehouse: A Subject of Study

Let us first ask our selves what should be the motivation for understanding and
studying Data Warehousing. What should be the motivation for having Data
Warehousing as a subject of study?

Realizing the importance of the Data Warehousing Technology in Decision Support


Systems industries, enterprises of all categories- small, medium and large - have
considered it as a matter of pride to deploy Data Warehousing system within the
organization.

Hence it is appropriate for any IT professional to be acquainted with the cutting edge
technology of Data Warehousing.

Overview of this course –I

We should first design a data warehouse, so we should know the techniques of Data
Warehouse designing like DBMS or any other business applications.

It is essential for us to understand Data modeling. But when you are studying Data
Warehousing, the concepts that we have learned in Data modeling and other courses
cannot be trivially extended to Data Warehousing Concepts.
Data Warehousing being a new technology, the data modeling aspects of Data
Warehousing are altogether different. The core part of Data Warehousing and data
modeling is what we call as multi-dimensional data modeling model. The complete
Data Warehousing organization, the way the data is stored in a Data Warehouse can
be viewed as a multi-dimensional model. It has got different business dimensions
and the core data is stored as a multidimensional array of these dimensions. It is
essential for Data Warehouse design to first do the dimensional modeling. Unless we
design dimensions, it will not be possible for us to do Data Warehouse design.

In data modeling is the sub-topic we call as Dimension Modeling. We also should


have Dimension Hierarchy. Combining the concept of Dimension Modeling and
Hierarchy, the data that is stored in the Multi-Dimensional way is also termed as the
Data Cube. Here, I should caution you that though in geometry we mean by the term
Cube as a three-dimensional structure, in Data Warehousing the moment we say it is
a Data Cube, it is a multi dimensional structure and not restricted only to three
dimensions.

The moment you visualize the Data Warehouse in a multi-dimensional structure we


call this model as a Data Cube.

Overview of this course –II

Let us recall that in our DBMS courses we have learnt about different schemas. In
the same manner Data Warehouse, where Data Modeling is a topic, we have different
Data Warehouse Schemas. Those are Star Schema, Snowflake Schema, and
Constellation. In fact, the most popular among these schemas is the Star Schema.

Essential part of the Star Schema is that dimensional modeling is carried out in a
Star Schema. Each dimension is represented as a dimension table, which is nothing
but a relational table. But, in Snowflake Schema, the dimension table has to be
normalized in the same way as you have learnt in a DBMS course. And hence, a
single dimension table will then be decomposed into sub tables. That is why it is
called a Snowflake Schema.

And in Constellation, it is a combination of multiple star schemas sharing the same


set of dimension table.

Another major aspect of Data Warehousing techniques is the OLAP engine. The OLAP
Engine provides all the different analytical processing that can be carried out on a
multi-dimensional data model or in other words on a Data Cube. But we have
different kinds of OLAP engines we call one as MOLAPAnd another is ROLAP, in fact
another is a combination of MOLAP and ROLAP, which we called as H(ybrid)OLAP. To
distinguish between MOLAP and ROLAP, MOLAP is kind of OLAP engine which
assumes the data as multi-dimensional data cube, it describes the data in terms of
dimensions and it views the data as a multi-dimensional data array. But ROLAP, R
representing relational, relational OLAP assumes that the Data Cube is stored
physically as a relational table.

Hence even if it views the Data cube in multi-dimensional fashion the physical
storage makes use of the relational DBMS and the relational table concepts. Naturally
the HOLAP makes use of the advantages of MOLAP and ROLAP.
Overview of this course –III

We shall also try to


understand different types of Data Warehousing in this course. We can have a
Centralized Data Warehouse or which we call as enterprise-wide Data Warehouse. We
can have a Virtual Data Warehouse, it is exactly in the same manner as if it is not
materialized. Recall that we have discussed about Materialized Vs. Virtual systems.
So Virtual warehouse is essentially a Data Warehouse, which is never physically
stored. And you can also have Data Marts.

In order to distinguish between a Centralized Data Warehouse and a Data Mart. Data
Mart essentially means that the Data Warehousing requirement for the whole
enterprise is too big to be handled. So for each department, may be for each
individual department and individual manager, a smaller Data Warehouse is
designed, which categorizes to his or her requirements and that we termed as Data
Mart. So the most popular concept of Data Mart essentially points to a Data
Warehouse in a smaller scale.

But again please remember that the concept of Data Mart is different from the
concept of Data Marting. I would also like to tell you here that the concept of Data
Warehouse basically explained is a warehouse. But the concept of Data Warehousing
is basically all the technology that is involved in maintaining, creating, analyzing a
Data Warehouse.

So in the similar fashion Data Marting is a process by which we should decide, we


should maintain and manage different data marts within the same organizations
where it is possible to have one Centralized Data Warehouse simultaneously allowing
certain Data Marts.

Having Understood the Data Warehousing concepts, the basic core part of Data
Warehouse, the main question now is how we populate data into a Data Warehouse.
One of the major aspects of Data Warehouse is that it contains data, which is
necessarily read only. You cannot change or update the data in a Data Warehouse.
We normally call changing of data in a Data Warehouse as data loading or data
refreshing.

Since the Data Warehouse is basically an integrated system it obtains data from
different sources. So data loading is an integrated process and which requires a
separate sub system. We call that system as an ETL system. ETL stands for
extracting, transforming, and loading.

Overview of this course –IV

So any Data Warehouse system must have an ETL process as it’s front-end. As a
result we can have a three-tier architecture of Data Warehouse.

• First level is an ETL process, the next level is Data Warehouse server,
• The next level is the OLAP engine
• The final tier contains Data Mining, Visualization and Report Generation.
The ETL process is not the part of three-tier architecture in the sense that this is a
front end to the Data Warehouse.

So tier-I should be Data Warehouse server, tier-II is OLAP engine and tier-III is
where the user interacts with the whole Data Warehouse as Data Mining,
Visualization, and Reporting systems.

Essentially, I have outlined all major aspects of the Data Warehousing process.

Conclusion

We have seen that Data Warehousing is an essential concept that is required for data
integration for easy access of data
for managerial decisions and for analytical processing and for the data that is
required for the knowledge workers. We have also understood different components
of data warehousing such as Data Marting, OLAP operations, Dimensional modeling
etc., so in the subsequent lessons each of these concepts will be elaborated in detail.

Vous aimerez peut-être aussi