Vous êtes sur la page 1sur 12

Data Warehouse: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of

data in support of management's decision making process ,

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a
particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may
have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a
product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6
months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often
only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer,
where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should
never be altered.
1.9.1 Data Warehouse Design Process: A data warehouse can be built using a top-down approach, a bottom-up
approach, or a combination of both.
The top-down approach starts with the overall design and planning. It is useful in cases where the technology is
mature and well known, and where the business problems that must be solved are clear and well understood.

The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of business
modeling and technology development. It allows an organization to move forward at considerably less expense and
to evaluate the benefits of the technology before making significant commitments.

In the combined approach, an organization can exploit the planned and strategic nature of the top-down approach
while retaining the rapid implementation and opportunistic application of the bottom-up approach.
The warehouse design process consists of the following steps:
Choose a business process to model, for example, orders, invoices, shipments, inventory, account administration,
sales, or the general ledger. If the business process is organizational and involves multiple complex object
collections, a data warehouse model should be followed. However, if the process is departmental and focuses on the
analysis of one kind of business process, a data mart model should be chosen.

Choose the grain of the business process. The grain is the fundamental, atomic level of data to be represented in the
fact table for this process, for example, individual transactions, individual daily snapshots, and so on.

Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer,
supplier, warehouse, transaction type, and status.

Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like
dollars sold and units sold.
1.9.2 A Three Tier Data Warehouse Architecture:

Tier-1:
The bottom tier is a warehouse database server that is almost always a relationaldatabase system. Back-
end tools and utilities are used to feed data into the bottomtier from operational databases or other
external sources (such as customer profileinformation provided by external consultants). These tools and
utilities performdataextraction, cleaning, and transformation (e.g., to merge similar data from
differentsources into a unified format), as well as load and refresh functions to update thedata warehouse .
The data are extracted using application programinterfaces known as gateways. A gateway is DEPT OF
CSE & IT VSSUT, Burla
supported by the underlying DBMS andallows client programs to generate SQL code to be executed at a
server. Examplesof gateways include ODBC (Open Database Connection) and OLEDB (Open
Linkingand Embedding for Databases) by Microsoft and JDBC (Java Database Connection). This tier also
contains a metadata repository, which stores information aboutthe data warehouse and its contents
. Tier-2: The middle tier is an OLAP server that is typically implemented using either a relational OLAP
(ROLAP) model or a multidimensional OLAP.
OLAP model is an extended relational DBMS thatmaps operations on multidimensional data to standard
relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements
multidimensional data and operations.
Tier-3: The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on). data mart is a simple form of a
data warehouse that is focused on a single subject (or functional area), hence they draw data
from a limited number of sources such as sales, finance or marketing.

Data marts are often built and controlled by a single department within an organization. The
sources could be internal operational systems, a central data warehouse, or external data.[8]
Denormalization is the norm for data modeling techniques in this system. Given that data marts
generally cover only a subset of the data contained in a data warehouse, they are often easier and
faster to implement.

Difference between data warehouse and data mart


Attribute Data warehouse Data mart
Scope of the data enterprise-wide department-wide
Number of subject areas multiple single
How difficult to build difficult easy
How much time takes to build more less
Amount of memory larger limited

Types of data marts include dependent, independent, and hybrid data marts

Type of Data Mart


There are three main types of data marts are:

1. Dependent: Dependent data marts are created by drawing data directly from operational,
external or both sources.
2. Independent: Independent data mart is created without the use of a central data
warehouse.
3. Hybrid: This type of data marts can take data from data warehouses or operational
systems.

Dependent Data Mart

A dependent data mart allows sourcing organization's data from a single Data Warehouse. It
offers the benefit of centralization. If you need to develop one or more physical data marts, then
you need to configure them as dependent data marts.

Dependent data marts can be built in two different ways. Either where a user can access both the
data mart and data warehouse, depending on need, or where access is limited only to the data
mart. The second approach is not optimal as it produces sometimes referred to as a data
junkyard. In the data junkyard, all data begins with a common source, but they are scrapped, and
mostly junked.
Independent Data Mart

An independent data mart is created without the use of central Data warehouse. This kind of
Data Mart is an ideal option for smaller groups within an organization.

An independent data mart has neither a relationship with the enterprise data warehouse nor with
any other data mart. In Independent data mart, the data is input separately, and its analyses are
also performed autonomously.

Implementation of independent data marts is antithetical to the motivation for building a data
warehouse. First of all, you need a consistent, centralized store of enterprise data which can be
analyzed by multiple users with different interests who want widely varying information.
Hybrid data Mart:

A hybrid data mart combines input from sources apart from Data warehouse. This could be
helpful when you want ad-hoc integration, like after a new group or product is added to the
organization.

It is best suited for multiple database environments and fast implementation turnaround for any
organization. It also requires least data cleansing effort. Hybrid Data mart also supports large
storage structures, and it is best suited for flexible for smaller data-centric applications.
Data Warehousing – Schemas

Advertisements

Previous Page

Next Page

Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational model,
while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we
will discuss the schemas used in a data warehouse.

Star Schema
 Each dimension in a star schema is represented with only one-dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.

Note − Each dimension has only one dimension table and each table holds a set of attributes. For
example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state and
country.

Snowflake Schema
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
 Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.

Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore,
it becomes easy to maintain and the save storage space.

Fact Constellation Schema


 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The following diagram shows two fact tables, namely sales and shipping.
 The sales fact table is same as that in the star schema.
 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and units sold.

It is also possible to share dimension tables between fact tables. For


example, time, item, Data Warehousing - Multidimensional OLAP

Advertisements

Previous Page

Next Page

Multidimensional OLAP (MOLAP) uses array-based multidimensional storage engines for


multidimensional views of data. With multidimensional data stores, the storage utilization may
be low if the dataset is sparse. Therefore, many MOLAP servers use two levels of data storage
representation to handle dense and sparse datasets.

Points to Remember −
 MOLAP tools process information with consistent response time regardless of level of
summarizing or calculations selected.
 MOLAP tools need to avoid many of the complexities of creating a relational database to
store data for analysis.
 MOLAP tools need fastest possible performance.
 MOLAP server adopts two level of storage representation to handle dense and sparse data
sets.
 Denser sub-cubes are identified and stored as array structure.
 Sparse sub-cubes employ compression technology.

MOLAP Architecture
MOLAP includes the following components −

 Database server.
 MOLAP server.
 Front-end tool.

Advantages
 MOLAP allows fastest indexing to the pre-computed summarized data.
 Helps the users connected to a network who need to analyze larger, less-defined data.
 Easier to use, therefore MOLAP is suitable for inexperienced users.

Disadvantages
 MOLAP are not capable of containing detailed data.
 The storage utilization may be low if the data set is sparse.

MOLAP vs ROLAP
Sr.No. MOLAP ROLAP

1 Information retrieval is fast. Information retrieval is comparatively slow.

2 Uses sparse array to store data-sets. Uses relational table.

MOLAP is best suited for inexperienced


3 ROLAP is best suited for experienced users.
users, since it is very easy to use.

Maintains a separate database for data It may not require space other than available in the
4
cubes. Data warehouse.

5 DBMS facility is weak. DBMS facility is strong.

OLAP vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)
Involves historical processing of
1 Involves day-to-day processing.
information.
OLAP systems are used by knowledge
OLTP systems are used by clerks, DBAs, or
2 workers such as executives, managers and
database professionals.
analysts.
3 Useful in analyzing the business. Useful in running the business.
4 It focuses on Information out. It focuses on Data in.
Based on Star Schema, Snowflake,
5 Based on Entity Relationship Model.
Schema and Fact Constellation Schema.
6 Contains historical data. Contains current data.
Provides summarized and consolidated
7 Provides primitive and highly detailed data.
data.
Provides summarized and Provides detailed and flat relational view of
8
multidimensional view of data. data.
9 Number or users is in hundreds. Number of users is in thousands.
Number of records accessed is in
10 Number of records accessed is in tens.
millions.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
12 Highly flexible. Provides high performance.

Vous aimerez peut-être aussi