Data Warehouse Concepts

DATA WAREHOUSE CONCEPTS
Data warehouse is subject Oriented, Integrated, Time-Variant and non-volatile

collection of data that support of management's decision making process.
Data warehousing is a collection of methods, techniques, and tools used to support
knowledge workerssenior managers, directors, managers, and analyststo
conduct data analyses that help with performing decision-making processes and
improving information resources.
A data warehouse is a collection of data that supports decision-making processes. A
Data Warehouse is a structured repository of historic data. It is developed in an
evolutionary process by integrating data from non-integrated legacy systems.
Data warehousing as a technological method aims to provide technical support to
companies with their data management needs as an important aspect in each
companys success.
Data Warehousing is a good investment and asset for the company especially since it
keeps the companys efficiency, productivity, profitability and competitive
performance. An organization collects various data from different areas of the
company more manageable including inventory needs, sales leads, customer service,
etc. These data are then passed through the data management system needed for the
companys policy-making measure.
The data warehouse is that portion of an overall Architected Data Environment that
serves as the single integrated source of data for processing information. The data
warehouse has specific characteristics that include the following:
Subject-Oriented: Information is presented according to specific subjects or areas of
interest, not simply as computer files. Data is manipulated to provide information
about a particular subject. For example, the SRDB is not simply made accessible to
end-users, but is provided structure and organized according to the specific needs.
Integrated: A single source of information for and about understanding multiple
areas of interest. The data warehouse provides one-stop shopping and contains
information about a variety of subjects. Thus the OIRAP data warehouse has
information on students, faculty and staff, instructional workload, and
student outcomes.
Non-Volatile: Stable information that doesnt change each time an operational
process is executed. Information is consistent regardless of when the warehouse is
accessed.
Time-Variant: Containing a history of the subject, as well as current information.
Historical information is an important component of a data warehouse.
Accessible: The primary purpose of a data warehouse is to provide readily
accessible information to end-users.
Process-Oriented: It is important to view data warehousing as a process for delivery
of information. The maintenance of a data warehouse is ongoing and iterative in
nature.
Note: - Data Warehouse does not require transaction processing, recovery and
concurrency control because it is physically stored separate from the operational
database.
OUR GOAL FOR A DATA WAREHOUSE?

Collect Data-Scrub, Integrate & Make It Accessible
Provide Information - For Our Businesses
Start Managing Knowledge
So Our Business Partners Will Gain Wisdom!
UNDERSTANDING DATA WAREHOUSE
The Data Warehouse is that database which is kept separate from the
organization's operational database.
There is no frequent updation done in data warehouse.
Data warehouse possess consolidated historical data which help the
organization to analyse its business.
Data warehouse helps the executives to organize, understand and use their
data to take strategic decision.
Data warehouse systems available which helps in integration of diversity of
application systems.
The Data warehouse system allows analysis of consolidated historical data
analysis.
DATA WAREHOUSE APPLICATIONS

Data Warehouse helps the business executives in organize, analyse and use their
data for decision making. Data Warehouse serves as a soul part of a plan-executeassess "closed-loop" feedback system for enterprise management. Data Warehouse is
widely used in the following fields:
Financial services
Banking Services
Consumer goods
Retail sectors.
Controlled manufacturing
DATA WAREHOUSE TYPES

Information processing, Analytical processing and Data Mining are the three types
of data warehouse applications that are discussed below:
Information processing - Data Warehouse allow us to process the information

stored in it. The information can be processed by means of querying, basic statistical
analysis, reporting using crosstabs, tables, charts, or graphs.
Analytical Processing - Data Warehouse supports analytical processing of the

information stored in it. The data can be analysed by means of basic OLAP
operations, including slice-and-dice, drill down, drill up, and pivoting.
Data Mining - Data Mining supports knowledge discovery by finding the hidden
patterns and associations, constructing analytical models, performing classification
and prediction. These mining results can be presented using the visualization tools.
DATA WAREHOUSE TOOLS AND UTILITIES FUNCTIONS

The following are the functions of Data Warehouse tools and Utilities:
Data Extraction - Data Extraction involves gathering the data from multiple
heterogeneous sources.
Data Cleaning - Data Cleaning involves finding and correcting the errors in data.
Data Transformation - Data Transformation involves converting data from legacy
format to warehouse format.
Data Loading - Data loading involves sorting, summarizing, consolidating, checking
integrity and building indices and partitions.
Refreshing - Refreshing involves updating from data sources to warehouse.
Note: Data Cleaning and Data Transformation are important steps in improving the
quality of data and data mining results.
DATA MINING DEFINITION

Data mining is the process of extracting previously unknown but significant
information from large databases and using it to make crucial business decisions.
Data mining transforms the data into information and tends to be bottom-up.
DATA MINING PROCESS

1. Data extraction process extracts useful subsets of data for mining.
2. Aggregation may be done if summary statistics are useful.
3. Initial searches should be carried out on aggregated data to develop a bird's

eye view of the information. (extracted information)
4. Focus on the detailed data provides a clearer view. (assimilated information)
OPERATIONAL VERSUS INFORMATIONAL SYSTEMS
Operational System
Informational System
1 Supports day-to-day decisions
Supports long-term, strategic decisions
2 Transaction driven
Analysis driven
3 Data constantly changes
Data rarely changes
4 Repetitive processing
Heuristic processing
5 Holds current data
Holds historical data
6 Stores detailed data
Stores summarized and detailed data
7 Application oriented
Subject oriented
8 Predictable pattern of usage
Unpredictable pattern of usage
Serves clerical, transactional
Serves managerial community
community
METADATA
It is data about data. It is used as
A directory to locate the contents of the data warehouse.

A guide to the mapping of data as the data is transformed
from the operational environment to the data warehouse
environment.
A guide to the algorithms used for summarization between
the current data and the summarized data.
It also contains information about
Structure of the data
Data extraction/transformation history
Data usage statistics
Data warehouse table sizes
Column sizes
Attribute hierarchies and dimensions
Performance metrics
Operational versus Data Warehouse Systems

Operational
Feature
Data Warehouse
Data content
current values
archival data, summarized

data, calculated data
Data organization
application by application
subject areas across enterprise
Nature of data
Dynamic
static until refreshed
Data
structure, suitable for operational simple; suitable for
complex; format
computation
business analysis
Access probability
High
moderate to low
Data update
updated on a
accessed and manipulated; no
field-by-field basis
direct update
Usage
highly
structured highly unstructured
repetitive processing
processing
Response time
sub second to 2-3 seconds seconds to minutes
analytical
DATA WAREHOUSE DESIGN APPROACHES

Data warehouse design is one of the key techniques in building the data warehouse.
Choosing a right data warehouse design can save the project time and cost. Basically
there are two data warehouse design approaches are popular.
BOTTOM-UP DESIGN:
In the bottom-up design approach, the data marts are created first to provide
reporting capability. A data mart addresses a single business area such as sales,
Finance etc. These data marts are then integrated to build a complete data
warehouse. The integration of data marts is implemented using data warehouse bus
architecture. In the bus architecture, a dimension is shared between facts in two or
more data marts. These dimensions are called conformed dimensions. These
conformed dimensions are integrated from data marts and then data warehouse is
built.
ADVANTAGES OF BOTTOM-UP DESIGN ARE:
This model contains consistent data marts and these data marts can be delivered
quickly.
As the data marts are created first, reports can be generated quickly.
The data warehouse can be extended easily to accommodate new business units. It is
just creating new data marts and then integrating with other data marts.
DISADVANTAGES OF BOTTOM-UP DESIGN ARE:

The positions of the data warehouse and the data marts are reversed in the bottomup approach design.
TOP-DOWN DESIGN:
In the top-down design approach the, data warehouse is built first. The data marts
are then created from the data warehouse.
ADVANTAGES OF TOP-DOWN DESIGN ARE:

Provides consistent dimensional views of data across data marts, as all data
marts are loaded from the data warehouse.
This approach is robust against business changes. Creating a new data mart
from the data warehouse is very easy.
DISADVANTAGES OF TOP-DOWN DESIGN ARE:

This methodology is inflexible to changing departmental needs during
implementation phase.
It represents a very large project and the cost of implementing the project is
significant.
DATA WAREHOUSE ARCHITECTURE

Three-Tier Data Warehouse Architecture
Generally the data warehouses adopt the three-tier architecture. Following are the
three tiers of data warehouse architecture.
Bottom Tier - The bottom tier of the architecture is the data warehouse
database server. It is the relational database system. We use the back end tools and
utilities to feed data into bottom tier. These back end tools and utilities perform the
Extract, Clean, Load, and refresh functions.
Middle Tier - In the middle tier we have OLAP Server. The OLAP Server can
be implemented in either of the following ways.
By relational OLAP (ROLAP), this is an extended relational database

management system. The ROLAP maps the operations on multidimensional data to
standard relational operations.
By Multidimensional OLAP (MOLAP) model, this directly implements

multidimensional data and operations.
Top-Tier - This tier is the front-end client layer. This layer holds the query
tools and reporting tool, analysis tools and data mining tools.
Following diagram explains the Three-tier Architecture of Data warehouse:
DATA WAREHOUSE MODELS

From the perspective of data warehouse architecture we have the following data
warehouse models:
Virtual Warehouse
Data mart
Enterprise Warehouse
VIRTUAL WAREHOUSE
The view over an operational data warehouse is known as virtual warehouse.

It is easy to build the virtual warehouse.
Building the virtual warehouse requires excess capacity on operational

database servers.
DATA MART
Data mart contains the subset of organisation-wide data.
This subset of data is valuable to specific group of an organisation

Note: in other words we can say that data mart contains only that data which is
specific to a particular group. For example the marketing data mart may contain only
data related to item, customers and sales. The data marts are confined to subjects.
Points to remember about data marts
Window based or Unix/Linux based servers are used to implement data

marts. They are implemented on low cost server.
The implementation cycle of data mart is measured in short period of time i.e.
in weeks rather than months or years.
The life cycle of a data mart may be complex in long run if it's planning and
designs are not organisation-wide.
Data marts are small in size.
Data marts are customized by department.
The source of data mart is departmentally structured data warehouse.
Data marts are flexible.

ENTERPRISE WAREHOUSE
The enterprise warehouse collects all the information all the subjects spanning
the entire organization
This provides us the enterprise-wide data integration.
This provides us the enterprise-wide data integration.
The data is integrated from operational systems and external information

providers.
This information can vary from a few gigabytes to hundreds of gigabytes,

terabytes or beyond.
LOAD MANAGER
This Component performs the operations required to extract and load

process.
The size and complexity of load manager varies between specific solutions
from data warehouse to data warehouse.
LOAD MANAGER ARCHITECTURE

The load manager performs the following functions:
Extract the data from source system.
Fast Load the extracted data into temporary data store.
Perform simple transformations into structure similar to the one in the data
warehouse.
EXTRACT
SOURCE
DATA
FROM
The data is extracted from the

operational databases or the
external
information
providers. A gateway is the

application programs that are
used to extract data. It is supported by underlying DBMS and allows client program
to generate SQL to be executed at a server.
FAST LOAD
In order to minimize the total load window the data need to be loaded into
the warehouse in the fastest possible time.
The transformations affect the speed of data processing.
It is more effective to load the data into relational database prior to applying
transformations and checks.
Gateway technology proves to be not suitable; since they tend not be
performant when large data volumes are involved.
SIMPLE TRANSFORMATIONS
While loading it may be required to perform simple transformations. After this has
been completed we are in position to do the complex checks. Suppose we are loading
the EPOS sales transaction we need to perform the following checks:
Strip out all the columns that are not required within the warehouse.
Convert all the values to required data types.

Warehouse Manager
Warehouse manager is responsible for the warehouse management process.
The warehouse manager consists of third party system software, C programs

and shell scripts.
The size and complexity of warehouse manager varies between specific
solutions.
WAREHOUSE MANAGER ARCHITECTURE

The warehouse manager includes
the following:
The Controlling process
Stored procedures or C with

SQL
Backup/Recovery tool
SQL Scripts
OPERATIONS PERFORMED BY WAREHOUSE MANAGER

Warehouse manager analyses the data to perform consistency and referential
integrity checks.
Creates the indexes, business views, partition views against the base data.
Generates the new aggregations and also updates the existing aggregation.
Generates the normalizations.
Warehouse manager Warehouse manager transforms and merge the source
data into the temporary store into the published data warehouse.
Backup the data in the data warehouse.
Warehouse Manager archives the data that has reached the end of its captured
life.
Note: Warehouse Manager also analyses query profiles to determine index and
aggregations are appropriate.
QUERY MANAGER
Query Manager is responsible for directing the queries to the suitable tables.
By directing the queries to appropriate table the query request and response
process is speed up.
Query Manager is responsible for scheduling the execution of the queries

posed by the user.
QUERY MANAGER ARCHITECTURE

Query Manager includes the following:
The query redirection via C tool or RDBMS.
Stored procedures.
Query Management tool.
Query Scheduling via C tool or RDBMS.

DETAILED INFORMATION
The following diagram shows the detailed information
The detailed information is not kept online rather is aggregated to the next level of
detail and then archived to the tape. The detailed information part of data
warehouse keeps the detailed information in the star flake schema. The detailed
information is loaded into the data warehouse to supplement the aggregated data.
Note: If the detailed information is held offline to minimize the disk storage we
should make sure that the data has been extracted, cleaned up, and transformed then
into star flake schema before it is archived.
In general, all data warehouse systems have the following layers:
Data Source Layer
Data Extraction Layer
Staging Area
ETL Layer
Data Storage Layer
Data Logic Layer
Data Presentation Layer
Metadata Layer
System Operations Layer
The picture below shows the relationships among the different components of the
data warehouse architecture:
Each component is discussed individually below:

Data Source Layer
This represents the different data sources that feed data into the data warehouse. The
data source can be of any format -- plain text file, relational database, other types of
database, Excel file, etc., can all act as a data source.
Many different types of data can be a data source:
Operations -- such as sales data, HR data, product data, inventory data,

marketing data, systems data.
Web server logs with user browsing data.
Internal market research data.
Third-party data, such as census data, demographics data, or survey data.
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the data warehouse system. There is likely
some minimal data cleansing, but there is unlikely any major data transformation.
Staging Area
This is where data sits prior to being scrubbed and transformed into a data
warehouse / data mart. Having one common area makes it easier for subsequent
data processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied to transform the data
from a transactional nature to an analytical nature. This layer is also where data
cleansing happens. The ETL design phase is often the most time-consuming phase in
a data warehousing project, and an ETL tool is often used in this layer.
Data Storage Layer
This is where the transformed and cleansed data sit. Based on scope and
functionality, 3 types of entities can be found here: data warehouse, data mart, and
operational data store (ODS). In any given system, you may have just one of the
three, two of the three, or all three types.
Data Logic Layer
This is where business rules are stored. Business rules stored here do not affect the
underlying data transformation rules, but do affect what the report looks like.
Data Presentation Layer
This refers to the information that reaches the users. This can be in a form of a
tabular / graphical report in a browser, an emailed report that gets automatically
generated and sent every day, or an alert that warns users of exceptions, among
others. Usually a tool and/or a reporting tool are used in this layer.
Metadata Layer
This is where information about the data stored in the data warehouse system is
stored. A logical data model would be an example of something that's in the
metadata layer. A metadata is often used to manage metadata.
System Operations Layer
This layer includes information on how the data warehouse system operates, such as
ETL job status, system performance, and user access history.
OTHER DEFINITIONS
Data Warehouse: A data structure that is optimized for distribution. It collects and
stores integrated sets of historical data from multiple operational systems and feeds
them to one or more data marts. It may also provide end-user access to support
enterprise views of data.
Data Mart: A data structure that is optimized for access. It is designed to facilitate
end-user analysis of data. It typically supports a single, analytic application used by
a distinct set of workers.
Staging Area: Any data store that is designed primarily to receive data into a
warehousing environment.
Operational Data Store: A collection of data that addresses operational needs of
various operational units. It is not a component of a data warehousing architecture,
but a solution to operational needs.
OLAP (On-Line Analytical Processing): A method by which multidimensional
analysis occurs.
Multidimensional Analysis: The ability to manipulate information by a variety of
relevant categories or dimensions to facilitate analysis and understanding of the
underlying data. It is also sometimes referred to as drilling-down, drilling-across
and slicing and dicing
Star Schema: A means of aggregating data based on a set of known dimensions. It
stores data multi-dimensionally in a two dimensional Relational Database
Management System (RDBMS), such as Oracle.
Snowflake Schema: An extension of the star schema by means of applying
additional dimensions to the dimensions of a star schema in a relational
environment.
Multidimensional Database: Also known as MDDB or MDDBS. A class of
proprietary, non-relational database management tools that store and manage data
in a multidimensional manner, as opposed to the two dimensions associated with

traditional relational database management systems.
OLAP Tools: A set of software products that attempt to facilitate multidimensional
analysis. Can incorporate data acquisition, data access, data manipulation, or any
combination thereof.
METADATA RESPIRATORY
The Metadata Respiratory is an integral part of data warehouse system. The
Metadata Respiratory contains the following metadata:
Business Metadata - This metadata has the data ownership information, business
definition and changing policies.
Operational Metadata -This metadata includes currency of data and data lineage.
Currency of data means whether data is active, archived or purged. Lineage of data
means history of data migrated and transformation applied on it.
Data for mapping from operational environment to data warehouse -This metadata
includes source databases and their contents, data extraction, data partition,
cleaning, transformation rules, data refresh and purging rules.
The algorithms for summarization - This includes dimension algorithms, data on
granularity, aggregation, summarizing etc.
DATA MART
A data mart is a subject-oriented archive that stores data and uses the retrieved set of
information to assist and support the requirements involved within a particular
business function or department. Data marts exist within a single
organizational data warehouse repository
A data mart is a repository of data gathered from operational data and other sources
that is designed to serve a particular community of knowledge workers.
A data mart is a repository of data that is designed to serve a particular community
of knowledge workers.
Data marts improve end-user response time by allowing users to have access to the
specific type of data they need to view most often by providing the data in a way
that supports the collective view of a group of users.
Metadata is simply defined as data about data. The data that are used to represent
other data is known as metadata. For example the index of a book serves as metadata
for the contents in the book. In other words we can say that metadata is the
summarized data that leads us to the detailed data. In terms of data warehouse we
can define metadata as following.
Metadata is a road map to data warehouse.
Metadata in data warehouse define the warehouse objects.
The metadata act as a directory. This directory helps the decision support
system to locate the contents of data warehouse.
Note: In data warehouse we create metadata for the data names and definitions of a
given data warehouse. Along with this metadata additional metadata are also
created for time stamping any extracted data, the source of extracted data.
Categories of Metadata
The metadata can be broadly categorized into three categories:
Business Metadata - This metadata has the data ownership information, business
definition and changing policies.
Technical Metadata - Technical metadata includes database system names, table
and column names and sizes, data types and allowed values. Technical metadata
also includes structural information such as primary and foreign key attributes
and indices.
Operational Metadata - This metadata includes currency of data and data
lineage. Currency of data means whether data is active, archived or purged.
Lineage of data means history of data migrated and transformation applied on it.
ROLE OF METADATA
Metadata has very important role in data warehouse. The role of metadata in
warehouse is different from the warehouse data yet it has very important role. The
various roles of metadata are explained below.
The metadata act as a directory.
This directory helps the decision support system to locate the contents of data
warehouse.
Metadata helps in decision support system for mapping of data when data are
transformed from operational environment to data warehouse environment.
Metadata helps in summarization between current detailed data and highly

summarized data.
Metadata also helps in summarization between lightly detailed data and

highly summarized data.
Metadata are also used for query tools.
Metadata are used in reporting tools.
Metadata are used in extraction and cleansing tools.
Metadata are used in transformation tools.
Metadata also plays important role in loading functions.
DIAGRAM TO UNDERSTAND ROLE OF METADATA.
WHY TO CREATE DATA MART
The following are the reasons to create data mart:
To partition data in order to impose access control strategies.
To speed up the queries by reducing the volume of data to be scanned.
To segment data into different hardware platforms.
To structure data in a form suitable for a user access tool.
Note: Do not data mart for any other reason since the operation cost of data marting
could be very high. Before data marting, make sure that data marting strategy is
appropriate for your particular solution.
Steps to determine that data mart appears to fit the bill
Following steps need to be followed to make cost effective data marting:
Identify the Functional Splits
Identify User Access Tool Requirements
Identify Access Control Issues
POINTS TO REMEMBER ABOUT DATA MARTS:
Window based or Unix/Linux based servers are used to implement data

marts. They are implemented on low cost server.
The implementation cycle of data mart is measured in short period of time i.e.
in weeks rather than months or years.
The life cycle of a data mart may be complex in long run if it's planning and
design is not organisation-wide.
Data mart is small in size.
Data mart is customized by department.
The source of data mart is departmentally structured data warehouse.
Data mart is flexible.
DATA WAREHOUSE V/S DATA MART
DATA WAREHOUSE:
Holds multiple subject areas
Holds very detailed information
Works to integrate all data sources
Does not necessarily use a dimensional model but feeds dimensional models.
DATA MART:
Often holds only one subject area- for example, Finance, or Sales
May hold more summarized data (although many hold full detail)
Concentrates on integrating information from a given subject area or set of

source systems
Is built focused on a dimensional model using a star schema.
REASONS FOR CREATING A DATA MART
Easy access to frequently needed data
Creates collective view by a group of users
Improves end-user response time
Ease of creation
Lower cost than implementing a full data warehouse
Potential users are more clearly defined than in a full data warehouse
Contains only business essential data and is less cluttered.
DECISION SUPPORT SYSTEM (DDS)

Decision support systems are interactive software-based systems intended to help
managers in decision making by accessing large volume of information generated
from various related information systems involved in organizational business
processes, like, office automation system, transaction processing system etc.
DSS uses the summary information, exceptions, patterns and trends using the
analytical models. Decision Support System helps in decision making but does not
always give a decision itself. The decision makers compile useful information from
raw data, documents, personal knowledge, and/or business models to identify and
solve problems and make decisions.
Programmed and Non-programmed Decisions
There are two types of decisions - programmed and non-programmed decisions.
Programmed decisions are basically automated processes, general routine work,
where:
These decisions have been taken several times
These decisions follow some guidelines or rules

For example, selecting a reorder level for inventories, is a programmed decision
Non-programmed decisions occur in unusual and non-addressed situations, so:
It would be a new decision
There will not be any rules to follow
These decisions are made based on available information
These decisions are based on the manger's discretion, instinct, perception and
judgment
For example, investing in a new technology is a non-programmed decision
Decision support systems generally involve non-programmed decisions. Therefore,
there will be no exact report, content or format for these systems. Reports are
generated on the fly.
ATTRIBUTES OF A DSS
Adaptability and flexibility
High level of Interactivity
Ease of use
Efficiency and effectiveness
Complete control by decision-makers.
Ease of development
Extendibility
Support for modelling and analysis
Support for data access
Standalone, integrated and Web-based

Characteristics of a DSS
Support for decision makers in semi structured and unstructured problems.
Support for managers at various managerial levels, ranging from top

executive to line managers.
Support for individuals and groups. Less structured problems often requires
the involvement of several individuals from different departments and organization
level.
Support for interdependent or sequential decisions.
Support for intelligence, design, choice, and implementation.
Support for variety of decision processes and styles
DSSs are adaptive over time.
BENEFITS OF DSS
Improves efficiency and speed of decision making activities
Increases the control, competitiveness and capability of futuristic decision

making of the organization
Facilitates interpersonal communication
Encourages learning or training
Since it is mostly used in non-programmed decisions, it reveals new

approaches and sets up new evidences for an unusual decision
Helps automate managerial processes
COMPONENTS OF A DSS
Following are the components of the Decision Support System:
Database Management System (DBMS): To solve a problem the necessary

data may come from internal or external database. In an organization, internal data
are generated by a system such as TPS and MIS.External data come from a variety of
sources such as newspapers, online data services, databases (financial, marketing,
human resources).
Model Management system: It stores and accesses models that managers use
to make decisions. Such models are used for designing manufacturing facility,
analyzing the financial health of an organization. Forecasting demand of a product
or service etc.
Support Tools: Support tools like online help; pull down menus, user interfaces,
graphical analysis, error correction mechanism, facilitates the user interactions with
the system.
Classification of DSS
There are several ways to classify DSS. Hoi Apple and Whinstone classify DSS in
following:
Text Oriented DSS: It contains textually represented information that could

have a bearing on decision. It allows documents to be electronically created, revise
and viewed as needed
Database Oriented DSS: Database plays a major role here; it contains
organized and highly structured data.
Spreadsheet Oriented DSS: it contains information in spread sheets that

allows create, view, modify procedural knowledge and also instruct the system to
execute self-contained instructions. The most popular tool is Excel and Lotus 1-2-3.
Solver Oriented DSS: it is based on a solver, which is an algorithm or
procedure written for performing certain calculations and particular program type.
Rules Oriented DSS: It follows certain procedures adopted as rules.
Rules Oriented DSS: Procedures are adopted in rules oriented DSS. Export
system is the example.
Compound DSS: It is built by using two or more of the five structures
explained above.
TYPES OF DSS
Following are some typical DSSs:
Status Inquiry System: helps in taking operational management level or

middle level management decisions, for example daily schedules of jobs to machines
or machines to operators.
Data Analysis System: needs comparative analysis and makes use of formula
or an algorithm, for example cash flow analysis, inventory analysis etc.
Information Analysis System: In this system data is analyzed and the
information report is generated. For example, sales analysis, accounts receivable
systems, market analysis etc.
Accounting System: keep tracks of accounting and finance related
information, for example, final account, accounts receivables, accounts payables etc.
that keep track of the major aspects of the business.
Model Based System: simulation models or optimization models used for
decision- making used infrequently and creates general guidelines for operation or
management.
EXECUTIVE SUPPORT SYSTEM (ESS)

Executive support systems are intended to be used by the senior managers directly to
provide support to non-programmed decisions in strategic management.
These information are often external, unstructured and even uncertain. Exact scope
and context of such information is often not known beforehand.
This information is intelligence based:
Market intelligence
Investment intelligence
Technology intelligence
Examples of Intelligent Information
Following are some examples

of
intelligent
information,
which is often source of an

ESS:
External databases
Technology reports like

patent records etc.
Technical reports from

consultants
Market
reports
Confidential information about competitors
Speculative information like market conditions
Government policies
Financial reports and information
ADVANTAGES OF ESS:
Easy for upper level executive to use
Ability to analyze trends
Augmentation of managers' leadership capabilities
Enhance personal thinking and decision making
Contribution to strategic control flexibility
Enhance organizational competitiveness in the market place
Increased executive time horizons.
Better reporting system
Improved mental model of business executive
Help improve consensus building and communication
Improve office automation
Reduce time for finding information
Better understanding
Time management
Increased communication capacity and quality
DISADVANTAGE OF ESS
Functions are limited
Hard to quantify benefits
Executive may encounter information overload
System may become slow
Difficult to keep current data
May lead to less reliable and insecure data
Excessive cost for small company
KNOWLEDGE MANAGEMENT SYSTEM (KMS)

All the systems we are discussing here come under knowledge management
category. A knowledge management system is not radically different from all these
information systems, but it just extends the already existing systems by assimilating
more information.
As we have seen data is raw facts, information is processed and/or interpreted data
and knowledge is personalized information.
What is knowledge?
personalized information
state of knowing and understanding
an object to be stored and manipulated
a process of applying expertise
a condition of access to information
potential to influence action
Sources of Knowledge of an Organization
Intranet
Data warehouses and knowledge repositories
Decision support tools
Groupware for supporting collaboration
Networks of knowledge workers
Internal expertise
DEFINITION OF KMS
Knowledge management comprises a range of practices used in an organization to identify,

create represent distribute and enable adoption to insight and experience. Such insights and
experience comprise knowledge, either embodied in individual or embedded in organizational
processes and practices.
PURPOSE OF A KMS
Improved performance
Competitive advantage
Innovation
Sharing of knowledge
Integration
Continuous improvement by:

o Driving strategy
o Starting new lines of business
o Solving problems faster
o Developing professional skills
o Recruit and retain talent
ACTIVITIES IN KNOWLEDGE MANAGEMENT
Start with the business problem and the business value to be delivered first.
Identify what kind of strategy to pursue to deliver this value and address the
KM problem
Think about the system required from a people and process point of view.
Finally, think about what kind of technical infrastructure are required to

support the people and processes.
Implement system and processes with appropriate change management and

iterative staged release.
LEVEL OF KNOWLEDGE MANAGEMENT
DATA WAREHOUSING - SYSTEM PROCESSES

We have fixed number of operations to be applied on operational databases and we
have well defined techniques such as use normalized data, keep table small etc.
These techniques are suitable for delivering a solution. But in case of decision
support system we do not know what query and operation need to be executed in
future. Therefore techniques applied on operational databases are not suitable for
data warehouses.
In this chapter well focus on designing data warehousing solution built on the top
open-system technologies like UNIX and relational databases.
PROCESS FLOW IN DATA WAREHOUSE

There are four major processes that build a data warehouse. Here is the list of four
processes:
Extract and load data.
Cleaning and transforming the data.
Backup and Archive the data.
Managing queries & directing them to the appropriate data sources.
Extract and Load Process
The Data Extraction takes data from the source systems.
Data load takes extracted data and loads it into data warehouse.
Note: Before loading the data into data warehouse the information extracted from
external sources must be reconstructed.
Points to remember while extract and load process:
Controlling the process
When to Initiate Extract
Loading the Data
CONTROLLING THE PROCESS

Controlling the process involves determining that when to start data extraction and
consistency check on data. Controlling process ensures that tools, logic modules, and
the programs are executed in correct sequence and at correct time.
WHEN TO INITIATE EXTRACT
Data need to be in consistent state when it is extracted i.e. the data warehouse should
represent single, consistent version of information to the user.
For example in a customer profiling data warehouse in telecommunication sector it is

illogical to merge list of customers at 8 pm on Wednesday from a customer database
with the customer subscription events up to 8 pm on Tuesday. This would mean that
we are finding the customers for whom there is no associated subscription.
LOADING THE DATA
After extracting the data it is loaded into a temporary data store. Here in the
temporary data store it is cleaned up and made consistent.
Note: Consistency checks are executed only when all data sources have been loaded
into temporary data store.
Clean and Transform Process
Once data is extracted and loaded into temporary data store it is the time to perform
Cleaning and Transforming. Here is the list of steps involved in Cleaning and
Transforming:
Clean and Transform the loaded data into a structure.
Partition the data.

CLEAN AND TRANSFORM THE LOADED DATA INTO A STRUCTURE
This will speed up the queries. This can be done in the following ways:
Make sure data is consistent within itself.
Make sure data is consistent with other data within the same data source.
Make sure data is consistent with data in other source systems.
Make sure data is consistent with data already in the warehouse.

Transforming involves converting the source data into a structure. Structuring the
data will result in increases query performance and decreases operational cost.
Information in data warehouse must be transformed to support performance
requirement from the business and also the ongoing operational cost.
PARTITION THE DATA

It will optimize the hardware performance and simplify the management of data
warehouse. In this we partition each fact table into a multiple separate partitions.
AGGREGATION
Aggregation is required to speed up the common queries. Aggregation relies on the
fact that most common queries will analyse a subset or an aggregation of the
detailed data.
BACKUP AND ARCHIVE THE DATA

In order to recover the data in event of data loss, software failure or hardware failure
it is necessary to backed up on regular basis. Archiving involves removing the old
data from the system in a format that allow it to be quickly restored whenever
required.
For example in a retail sales analysis data warehouse, it may be required to keep data
for 3 years with latest 6 months data being kept online. In this kind of scenario there
is often requirement to be able to do month-on-month comparisons for this year and
last year. In this case we require some data to be restored from the archive.
QUERY MANAGEMENT PROCESS

This process performs the following functions
This process manages the queries.
This process speed up the queries execution.
This Process directs the queries to most effective data sources.
This process should also ensure that all system sources are used in most
effective way.
This process is also required to monitor actual query profiles.
Information in this process is used by warehouse management process to

determine which aggregations to generate.
This process does not generally operate during regular load of information
into data warehouse.
DATA WAREHOUSING - OLAP
INTRODUCTION
Online Analytical Processing Server (OLAP) is based on multidimensional data

model. It allows the managers, analysts to get insight the information through fast,
consistent, interactive access to information. In this chapter we will discuss about
types of OLAP, operations on OLAP, Difference between OLAP and Statistical
Databases and OLTP.
Feature
OLTP
OLAP
Purpose
Run day-to-day operation
Information retrieval and analysis
Structure
RDBMS
RDBMS
Data Model
Normalized
Multidimensional
Access
SQL
SQL plus data analysis extensions
Type of Data
Data that runs the business Data to analyse the business
Condition of data Changing, incomplete
Historical, descriptive
TYPES OF OLAP SERVERS

We have four types of OLAP servers that are listed below.
Relational OLAP(ROLAP)
Multidimensional OLAP (MOLAP)
Hybrid OLAP (HOLAP)
Specialized SQL Servers
RELATIONAL OLAP (ROLAP)

The Relational OLAP servers are placed between relational back-end server and
client front-end tools. To store and manage warehouse data the Relational OLAP use
relational or extended-relational DBMS.
ROLAP includes the following.
Implementation of aggregation navigation logic.
Optimization for each DBMS back end.
Additional tools and services.
MULTIDIMENSIONAL OLAP (MOLAP)

Multidimensional OLAP (MOLAP) uses the array-based multidimensional storage
engines for multidimensional views of data. With multidimensional data stores, the
storage utilization may be low if the data set is sparse. Therefore many MOLAP
Server uses the two level of data storage representation to handle dense and sparse
data sets.
HYBRID OLAP (HOLAP)

The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both
the higher scalability of ROLAP and faster computation of MOLAP. HOLAP server
allows storing the large data volumes of detail data. The aggregations are stored
separated in MOLAP store.
Specialized SQL Servers
specialized SQL servers provides advanced query language and query processing
support for SQL queries over star and snowflake schemas in a read-only
environment.
OLAP Operations
As we know that the OLAP server is based on the multidimensional view of data
hence we will discuss the OLAP operations in multidimensional data.
Here is the list of OLAP operations.
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
ROLL-UP
This operation performs aggregation on a data cube in any of the following way:
By climbing up a concept hierarchy for a dimension
By dimension reduction.
Consider the following diagram showing the roll-up operation.
The roll-up operation is performed by climbing up a concept hierarchy for the

dimension location.
Initially the concept hierarchy was "street < city < province < country".
On rolling up the data is aggregated by ascending the location hierarchy from

the level of city to level of country.
The data is grouped into cities rather than countries.
When roll-up operation is performed then one or more dimensions from the
data cube are removed.
DRILL-DOWN
Drill-down operation is reverse of the roll-up. This operation is performed by either
of the following way:
By stepping down a concept hierarchy for a dimension.
By introducing new dimension.

Consider the following diagram showing the drill-down operation:
The drill-down operation is performed by stepping down a concept hierarchy

for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drill-up the time dimension is descended from the level quarter to the
level of month.
When drill-down operation is performed then one or more dimensions from

the data cube are added.
It navigates the data from less detailed data to highly detailed data.
SLICE
The slice operation performs selection of one dimension on a given cube and gives us
a new sub cube. Consider the following diagram showing the slice operation.
The Slice operation is performed for the dimension time using the criterion
time ="Q1".
It will form a new sub cube by selecting one or more dimensions.
DICE
The Dice operation performs selection of two or more dimension on a given cube
and gives us a new sub cube. Consider the following diagram showing the dice
operation:
The dice operation on the cube based on the following selection criteria that involve
three dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem").
PIVOT
The pivot operation is also known as rotation. It rotates the data axes in view in
order to provide an alternative presentation of data. Consider the following diagram
showing the pivot operation.
In this the item and location axes in 2-D slice are rotated.
DATA WAREHOUSING - FUTURE ASPECTS
Following are the future aspects of Data Warehousing.
As we have seen that the size of the open database has grown approximately
double the magnitude in last few years. This change in magnitude is of greater
significance.
As the size of the databases grows, the estimates of what constitutes a very
large database continue to grow.
The Hardware and software that are available today do not allow keeping a
large amount of data online. For example a Telco call record requires 10TB of data to
be kept online which is just a size of one month record. If it requires keeping record
of sales, marketing customer, employee etc. then the size will be more than 100 TB.
The record not only contains the textual information but also contain some
multimedia data. Multimedia data cannot be easily manipulated as text data.
Searching the multimedia data is not an easy task whereas the textual information
can be retrieved by the relational software available today.
Apart from size planning, building and running ever-larger data warehouse
systems are very complex. As the number of users increases the size of the data
warehouse also increases. These users will also require to access to the system.
With growth of internet there is requirement of users to access data online.
QUESTION AND ANSWER OF DATA WAREHOUSE

Q: Define Data Warehouse?
A: Data warehouse is Subject Oriented, Integrated, Time-Variant and Non-volatile

collection of data that support management's decision making process.
Q: What does the subject oriented data warehouse signifies?
A: Subject oriented signifies that the data warehouse stores the information around a
particular subject such as product, customer, sales etc.
Q: List any five applications of Data Warehouse?
A: Some applications include financial services, Banking Services, Customer goods,
Retail Sectors, Controlled Manufacturing.
Q: What does OLAP and OLTP stand for?
A: OLAP is acronym of Online Analytical Processing and OLTP is acronym of
Online Transactional Processing
Q: What is the very basic difference between data warehouse and Operational
Databases?
A: Data warehouse contains the historical information that is made available for
analysis of the business whereas the Operational database contains the current
information that is required to run the business.
Q: List the Schema that Data Warehouse System implements?
A: Data Warehouse can implement Star Schema, Snowflake Schema or the Fact
Constellation Schema
Q: What is Data Warehousing?
A: Data Warehousing is the process of constructing and using the data warehouse.
Q: List the process that are involved in Data Warehousing?
A: Data Warehousing involves data cleaning, data integration
consolidations.
Q: List the functions of data warehouse tools and utilities?
and
data
A: The functions performed by Data warehouse tool and utilities are Data Extraction,
Data Cleaning, Data Transformation, Data Loading and Refreshing
Q: What do you mean by Data Extraction?
A: Data Extraction means gathering the data from multiple heterogeneous sources.
Q: Define Metadata?
A: Metadata is simply defined as data about data. In other words we can say that
metadata is the summarized data that lead us to the detailed data.
Q: What does Metadata Respiratory contains?
A: Metadata respiratory contains Definition of data warehouse, Business Metadata,
Operational Metadata, Data for mapping from operational environment to data
warehouse and the Algorithms for summarization
Q: How does a Data Cube help?
A: Data cube help us to represent the data in multiple dimensions. The data cube is
defined by dimensions and facts.
Q: Define Dimension?
A: The dimensions are the entities with respect to which an enterprise keeps the
records.
Q: Explain Data mart?
A: Data mart contains the subset of organisation-wide data. This subset of data is
valuable to specific group of an organisation. In other words we can say that data
mart contains only that data which is specific to a particular group.
Q: What is Virtual Warehouse?
A: The view over an operational data warehouse is known as virtual warehouse.
Q: List the phases involved in Data warehouse delivery Process?
A: The stages are IT strategy, Education, Business Case Analysis, technical Blueprint,
Build the version, History Load, Ad hoc query, Requirement Evolution, Automation,
Extending Scope.
Q: Explain Load Manager?
A: This Component performs the operations required to extract and load process.
The size and complexity of load manager varies between specific solutions from data
warehouse to data warehouse.
Q: Define the function of Load Manager?
A: Extract the data from source system. Fast Load the extracted data into temporary
data store. Perform simple transformations into structure similar to the one in the
data warehouse.
Q: Explain Warehouse Manager?
A: Warehouse manager is responsible for the warehouse management process. The
warehouse manager consists of third party system software, C programs and shell
scripts. The size and complexity of warehouse manager varies between specific
solutions.
Q: Define functions of Warehouse Manager?
A: The Warehouse Manager performs consistency and referential integrity checks,
Creates the indexes, business views, partition views against the base data, transforms
and merge the source data into the temporary store into the published data
warehouse, Backup the data in the data warehouse and archives the data that has
reached the end of its captured life.
Q: What is Summary Information?
A: Summary Information is the area in data warehouse where the predefined
aggregations are kept.
Q: What does the Query Manager responsible for?
A: Query Manager is responsible for directing the queries to the suitable tables.
Q: List the types of OLAP server?
A: There are four types of OLAP Server namely Relational OLAP, Multidimensional
OLAP, Hybrid OLAP, and Specialized SQL Servers
Q: Which one is faster Multidimensional OLAP or Relational OLAP?
A: Multidimensional OLAP is faster than the Relational OLAP

Q: List the functions performed by OLAP?
A: The functions such as roll-up, drill-down, slice, dice, and pivot are performed by
OLAP
Q: How many dimensions are selected in Slice operation?
A: Only one dimension is selected for the slice operation.
Q: How many dimensions are selected in dice operation?
A: For dice operation two or more dimensions are selected for a given cube.
Q: How many fact tables are there in Star Schema?
A: There is only one fact table in Star Schema.
Q: What is Normalization?
A: The normalization split up the data into additional tables.
Q: out of Star Schema and Snowflake Schema, the dimension table is normalised?
A: The snowflake schema uses the concept of normalization.
Q: What is the benefit of Normalization?
A: Normalization helps to reduce the data redundancy.
Q: Which language is used for defining Schema Definition?
A: Data Mining Query Language (DMQL) id used for Schema Definition.
Q: What language is the base of DMQL?
A: DMQL is based on Structured Query Language (SQL)
Q: What are the reasons for partitioning?
A: Partitioning is done for various reasons such as easy management, to assist
backup recovery, to enhance performance.
Q: What kind of costs is involved in Data Martin?
A: Data Marting involves Hardware & Software cost, Network access cost and Time
cost.
FACTOR ANALYSIS
WHY USE FACTOR ANALYSIS?

Factor analysis is a useful tool for investigating variable relationships for complex
concepts such as socioeconomic status, dietary patterns, or psychological scales.
It allows researchers to investigate concepts that are not easily measured directly by
collapsing a large number of variables into a few interpretable underlying factors.
WHAT IS A FACTOR?
The key concept of factor analysis is that multiple observed variables have similar
patterns of responses because of
For example, people may respond similarly to questions about income, education,
and occupation, which are all associated with the latent variable socioeconomic
status.
In every factor analysis, there is the same number of factors as there are variables.
Each factor captures a certain amount of the overall variance in the observed
variables, and the factors are always listed in order of how much variation they
explain.
The eigen value is a measure of how much of the variance of the observed variables a
factor explains. Any factor with an eigen value 1 explains more variance than a
single observed variable.
So if the factor for socioeconomic status had an Eigen value of 2.3 it would explain as
much variance as 2.3 of the three variables. This factor, which captures most of the
variance in those three variables, could then be used in other analyses.
The factors that explain the least amount of variance are generally discarded.
Deciding how many factors are useful to retain will be the subject of another post.
WHAT ARE FACTOR LOADINGS?
The relationship of each variable to the underlying factor is expressed by the socalled factor loading. Here is an example of the output of a simple factor analysis
looking at indicators of wealth, with just six variables and two resulting factors.
Variables
Factor 1
Factor 2
Income
0.65
0.11
Education
0.59
0.25
Occupation
0.48
0.19
House value
0.38
0.60
Number of public parks in 0.13

neighbourhood
0.57
Number of violent crimes per 0.23

year in neighbourhood
0.55
The variable with the strongest association to the underlying latent variable. Factor 1,
is income, with a factor loading of 0.65.
Since factor loadings can be interpreted like standardized regression coefficients, one
could also say that the variable income has a correlation of 0.65 with Factor 1. This
would be considered a strong association for a factor analysis in most research fields.
Two other variables, education and occupation, are also associated with Factor 1.
Based on the variables loading highly onto Factor 1, we could call it Individual
socioeconomic status.
House value, number of public parks, and number of violent crimes per year,
however, have high factor loadings on the other factor, Factor 2. They seem to
indicate the overall wealth within the neighbourhood, so we may want to call Factor
2 Neighbourhood socioeconomic status.
Notice that the variable house value also is marginally important in Factor 1 (loading
= 0.38). This makes sense, since the value of a persons house should be associated
with his or her income.
FEATURES OF FACTOR ANALYSIS
Data reduction tool

Removes redundancy or duplication from a set of
Correlated variables
Represents correlated variables with a smaller set of derived variables.
Factors are formed that are relatively independent of one another.
Two types of variables:
LATENT VARIABLES: FACTORS

OBSERVED VARIABLES
Some Applications of Factor Analysis
1. Identification of Underlying Factors:
clusters variables into homogeneous sets
creates new variables (i.e. factors)
allows us to gain insight to categories
2. Screening of Variables:
identifies groupings to allow us to select one variable to represent
many
useful in regression (recall collinearity)
3. Summary:
Allows us to describe many variables using a few factors
4. Clustering of objects:
Helps us to put objects (people) into categories depending on their
factor scores

Data Warehouse Concepts

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Warehouse Concepts

Transféré par

Droits d'auteur :

Formats disponibles

DATA WAREHOUSE CONCEPTS

Data warehouse is subject Oriented, Integrated, Time-Variant and non-volatile

OUR GOAL FOR A DATA WAREHOUSE?

UNDERSTANDING DATA WAREHOUSE

DATA WAREHOUSE APPLICATIONS

DATA WAREHOUSE TYPES

Information processing - Data Warehouse allow us to process the information

Analytical Processing - Data Warehouse supports analytical processing of the

DATA WAREHOUSE TOOLS AND UTILITIES FUNCTIONS

DATA MINING DEFINITION

DATA MINING PROCESS

3. Initial searches should be carried out on aggregated data to develop a bird's

OPERATIONAL VERSUS INFORMATIONAL SYSTEMS

1 Supports day-to-day decisions

Supports long-term, strategic decisions

3 Data constantly changes

Data rarely changes

5 Holds current data

Holds historical data

6 Stores detailed data

Stores summarized and detailed data

8 Predictable pattern of usage

Unpredictable pattern of usage

Serves clerical, transactional

Serves managerial community

A directory to locate the contents of the data warehouse.

It also contains information about

Structure of the data

Data extraction/transformation history

Data usage statistics

Data warehouse table sizes

Attribute hierarchies and dimensions

Operational versus Data Warehouse Systems

archival data, summarized

subject areas across enterprise

static until refreshed

accessed and manipulated; no

sub second to 2-3 seconds seconds to minutes

DATA WAREHOUSE DESIGN APPROACHES

ADVANTAGES OF BOTTOM-UP DESIGN ARE:

DISADVANTAGES OF BOTTOM-UP DESIGN ARE:

ADVANTAGES OF TOP-DOWN DESIGN ARE:

DISADVANTAGES OF TOP-DOWN DESIGN ARE:

DATA WAREHOUSE ARCHITECTURE

By relational OLAP (ROLAP), this is an extended relational database

By Multidimensional OLAP (MOLAP) model, this directly implements

DATA WAREHOUSE MODELS

The view over an operational data warehouse is known as virtual warehouse.

Building the virtual warehouse requires excess capacity on operational

Data mart contains the subset of organisation-wide data.

This subset of data is valuable to specific group of an organisation

Window based or Unix/Linux based servers are used to implement data

Data marts are small in size.

Data marts are customized by department.

The source of data mart is departmentally structured data warehouse.

Data marts are flexible.

This provides us the enterprise-wide data integration.

This provides us the enterprise-wide data integration.

The data is integrated from operational systems and external information

This information can vary from a few gigabytes to hundreds of gigabytes,

This Component performs the operations required to extract and load

LOAD MANAGER ARCHITECTURE

Extract the data from source system.