Vous êtes sur la page 1sur 53

DATA WAREHOUSE CONCEPTS

Data warehouse is subject Oriented, Integrated, Time-Variant and non-volatile


collection of data that support of management's decision making process.
Data warehousing is a collection of methods, techniques, and tools used to support
knowledge workerssenior managers, directors, managers, and analyststo
conduct data analyses that help with performing decision-making processes and
improving information resources.
A data warehouse is a collection of data that supports decision-making processes. A
Data Warehouse is a structured repository of historic data. It is developed in an
evolutionary process by integrating data from non-integrated legacy systems.
Data warehousing as a technological method aims to provide technical support to
companies with their data management needs as an important aspect in each
companys success.
Data Warehousing is a good investment and asset for the company especially since it
keeps the companys efficiency, productivity, profitability and competitive
performance. An organization collects various data from different areas of the
company more manageable including inventory needs, sales leads, customer service,
etc. These data are then passed through the data management system needed for the
companys policy-making measure.
The data warehouse is that portion of an overall Architected Data Environment that
serves as the single integrated source of data for processing information. The data
warehouse has specific characteristics that include the following:
Subject-Oriented: Information is presented according to specific subjects or areas of
interest, not simply as computer files. Data is manipulated to provide information
about a particular subject. For example, the SRDB is not simply made accessible to
end-users, but is provided structure and organized according to the specific needs.
Integrated: A single source of information for and about understanding multiple
areas of interest. The data warehouse provides one-stop shopping and contains
information about a variety of subjects. Thus the OIRAP data warehouse has
information on students, faculty and staff, instructional workload, and
student outcomes.
Non-Volatile: Stable information that doesnt change each time an operational
process is executed. Information is consistent regardless of when the warehouse is

accessed.
Time-Variant: Containing a history of the subject, as well as current information.
Historical information is an important component of a data warehouse.
Accessible: The primary purpose of a data warehouse is to provide readily
accessible information to end-users.
Process-Oriented: It is important to view data warehousing as a process for delivery
of information. The maintenance of a data warehouse is ongoing and iterative in
nature.
Note: - Data Warehouse does not require transaction processing, recovery and
concurrency control because it is physically stored separate from the operational
database.

OUR GOAL FOR A DATA WAREHOUSE?


Collect Data-Scrub, Integrate & Make It Accessible
Provide Information - For Our Businesses
Start Managing Knowledge
So Our Business Partners Will Gain Wisdom!

UNDERSTANDING DATA WAREHOUSE

The Data Warehouse is that database which is kept separate from the
organization's operational database.
There is no frequent updation done in data warehouse.
Data warehouse possess consolidated historical data which help the
organization to analyse its business.
Data warehouse helps the executives to organize, understand and use their
data to take strategic decision.
Data warehouse systems available which helps in integration of diversity of
application systems.
The Data warehouse system allows analysis of consolidated historical data
analysis.

DATA WAREHOUSE APPLICATIONS


Data Warehouse helps the business executives in organize, analyse and use their
data for decision making. Data Warehouse serves as a soul part of a plan-executeassess "closed-loop" feedback system for enterprise management. Data Warehouse is
widely used in the following fields:

Financial services
Banking Services
Consumer goods
Retail sectors.
Controlled manufacturing

DATA WAREHOUSE TYPES


Information processing, Analytical processing and Data Mining are the three types
of data warehouse applications that are discussed below:

Information processing - Data Warehouse allow us to process the information


stored in it. The information can be processed by means of querying, basic statistical
analysis, reporting using crosstabs, tables, charts, or graphs.

Analytical Processing - Data Warehouse supports analytical processing of the


information stored in it. The data can be analysed by means of basic OLAP
operations, including slice-and-dice, drill down, drill up, and pivoting.

Data Mining - Data Mining supports knowledge discovery by finding the hidden
patterns and associations, constructing analytical models, performing classification
and prediction. These mining results can be presented using the visualization tools.

DATA WAREHOUSE TOOLS AND UTILITIES FUNCTIONS


The following are the functions of Data Warehouse tools and Utilities:
Data Extraction - Data Extraction involves gathering the data from multiple
heterogeneous sources.
Data Cleaning - Data Cleaning involves finding and correcting the errors in data.
Data Transformation - Data Transformation involves converting data from legacy
format to warehouse format.
Data Loading - Data loading involves sorting, summarizing, consolidating, checking
integrity and building indices and partitions.
Refreshing - Refreshing involves updating from data sources to warehouse.
Note: Data Cleaning and Data Transformation are important steps in improving the
quality of data and data mining results.

DATA MINING DEFINITION


Data mining is the process of extracting previously unknown but significant
information from large databases and using it to make crucial business decisions.
Data mining transforms the data into information and tends to be bottom-up.

DATA MINING PROCESS


1. Data extraction process extracts useful subsets of data for mining.
2. Aggregation may be done if summary statistics are useful.

3. Initial searches should be carried out on aggregated data to develop a bird's


eye view of the information. (extracted information)
4. Focus on the detailed data provides a clearer view. (assimilated information)

OPERATIONAL VERSUS INFORMATIONAL SYSTEMS

Operational System

Informational System

1 Supports day-to-day decisions

Supports long-term, strategic decisions

2 Transaction driven

Analysis driven

3 Data constantly changes

Data rarely changes

4 Repetitive processing

Heuristic processing

5 Holds current data

Holds historical data

6 Stores detailed data

Stores summarized and detailed data

7 Application oriented

Subject oriented

8 Predictable pattern of usage

Unpredictable pattern of usage

Serves clerical, transactional

Serves managerial community

community

METADATA
It is data about data. It is used as

A directory to locate the contents of the data warehouse.


A guide to the mapping of data as the data is transformed
from the operational environment to the data warehouse
environment.
A guide to the algorithms used for summarization between
the current data and the summarized data.

It also contains information about

Structure of the data

Data extraction/transformation history

Data usage statistics

Data warehouse table sizes

Column sizes

Attribute hierarchies and dimensions

Performance metrics

Operational versus Data Warehouse Systems


Operational
Feature

Data Warehouse

Data content

current values

archival data, summarized


data, calculated data

Data organization

application by application

subject areas across enterprise

Nature of data

Dynamic

static until refreshed

Data
structure, suitable for operational simple; suitable for
complex; format
computation
business analysis
Access probability

High

moderate to low

Data update

updated on a

accessed and manipulated; no

field-by-field basis

direct update

Usage

highly
structured highly unstructured
repetitive processing
processing

Response time

sub second to 2-3 seconds seconds to minutes

analytical

DATA WAREHOUSE DESIGN APPROACHES


Data warehouse design is one of the key techniques in building the data warehouse.
Choosing a right data warehouse design can save the project time and cost. Basically
there are two data warehouse design approaches are popular.
BOTTOM-UP DESIGN:
In the bottom-up design approach, the data marts are created first to provide
reporting capability. A data mart addresses a single business area such as sales,
Finance etc. These data marts are then integrated to build a complete data
warehouse. The integration of data marts is implemented using data warehouse bus
architecture. In the bus architecture, a dimension is shared between facts in two or
more data marts. These dimensions are called conformed dimensions. These
conformed dimensions are integrated from data marts and then data warehouse is
built.

ADVANTAGES OF BOTTOM-UP DESIGN ARE:

This model contains consistent data marts and these data marts can be delivered
quickly.

As the data marts are created first, reports can be generated quickly.

The data warehouse can be extended easily to accommodate new business units. It is
just creating new data marts and then integrating with other data marts.

DISADVANTAGES OF BOTTOM-UP DESIGN ARE:


The positions of the data warehouse and the data marts are reversed in the bottomup approach design.

TOP-DOWN DESIGN:
In the top-down design approach the, data warehouse is built first. The data marts
are then created from the data warehouse.

ADVANTAGES OF TOP-DOWN DESIGN ARE:


Provides consistent dimensional views of data across data marts, as all data
marts are loaded from the data warehouse.

This approach is robust against business changes. Creating a new data mart
from the data warehouse is very easy.

DISADVANTAGES OF TOP-DOWN DESIGN ARE:


This methodology is inflexible to changing departmental needs during
implementation phase.
It represents a very large project and the cost of implementing the project is
significant.

DATA WAREHOUSE ARCHITECTURE


Three-Tier Data Warehouse Architecture
Generally the data warehouses adopt the three-tier architecture. Following are the
three tiers of data warehouse architecture.

Bottom Tier - The bottom tier of the architecture is the data warehouse
database server. It is the relational database system. We use the back end tools and
utilities to feed data into bottom tier. These back end tools and utilities perform the
Extract, Clean, Load, and refresh functions.

Middle Tier - In the middle tier we have OLAP Server. The OLAP Server can
be implemented in either of the following ways.

By relational OLAP (ROLAP), this is an extended relational database


management system. The ROLAP maps the operations on multidimensional data to
standard relational operations.

By Multidimensional OLAP (MOLAP) model, this directly implements


multidimensional data and operations.

Top-Tier - This tier is the front-end client layer. This layer holds the query
tools and reporting tool, analysis tools and data mining tools.
Following diagram explains the Three-tier Architecture of Data warehouse:

DATA WAREHOUSE MODELS


From the perspective of data warehouse architecture we have the following data
warehouse models:

Virtual Warehouse

Data mart

Enterprise Warehouse
VIRTUAL WAREHOUSE

The view over an operational data warehouse is known as virtual warehouse.


It is easy to build the virtual warehouse.

Building the virtual warehouse requires excess capacity on operational


database servers.

DATA MART

Data mart contains the subset of organisation-wide data.

This subset of data is valuable to specific group of an organisation


Note: in other words we can say that data mart contains only that data which is
specific to a particular group. For example the marketing data mart may contain only
data related to item, customers and sales. The data marts are confined to subjects.
Points to remember about data marts

Window based or Unix/Linux based servers are used to implement data


marts. They are implemented on low cost server.

The implementation cycle of data mart is measured in short period of time i.e.
in weeks rather than months or years.

The life cycle of a data mart may be complex in long run if it's planning and
designs are not organisation-wide.

Data marts are small in size.

Data marts are customized by department.

The source of data mart is departmentally structured data warehouse.

Data marts are flexible.


ENTERPRISE WAREHOUSE

The enterprise warehouse collects all the information all the subjects spanning
the entire organization

This provides us the enterprise-wide data integration.

This provides us the enterprise-wide data integration.

The data is integrated from operational systems and external information


providers.

This information can vary from a few gigabytes to hundreds of gigabytes,


terabytes or beyond.
LOAD MANAGER

This Component performs the operations required to extract and load


process.

The size and complexity of load manager varies between specific solutions
from data warehouse to data warehouse.

LOAD MANAGER ARCHITECTURE


The load manager performs the following functions:

Extract the data from source system.

Fast Load the extracted data into temporary data store.

Perform simple transformations into structure similar to the one in the data
warehouse.

EXTRACT
SOURCE

DATA

FROM

The data is extracted from the


operational databases or the
external

information

providers. A gateway is the


application programs that are
used to extract data. It is supported by underlying DBMS and allows client program
to generate SQL to be executed at a server.
FAST LOAD

In order to minimize the total load window the data need to be loaded into
the warehouse in the fastest possible time.

The transformations affect the speed of data processing.

It is more effective to load the data into relational database prior to applying
transformations and checks.
Gateway technology proves to be not suitable; since they tend not be

performant when large data volumes are involved.

SIMPLE TRANSFORMATIONS
While loading it may be required to perform simple transformations. After this has
been completed we are in position to do the complex checks. Suppose we are loading
the EPOS sales transaction we need to perform the following checks:

Strip out all the columns that are not required within the warehouse.

Convert all the values to required data types.


Warehouse Manager

Warehouse manager is responsible for the warehouse management process.

The warehouse manager consists of third party system software, C programs


and shell scripts.
The size and complexity of warehouse manager varies between specific

solutions.

WAREHOUSE MANAGER ARCHITECTURE


The warehouse manager includes
the following:

The Controlling process

Stored procedures or C with


SQL

Backup/Recovery tool

SQL Scripts

OPERATIONS PERFORMED BY WAREHOUSE MANAGER


Warehouse manager analyses the data to perform consistency and referential

integrity checks.

Creates the indexes, business views, partition views against the base data.

Generates the new aggregations and also updates the existing aggregation.
Generates the normalizations.
Warehouse manager Warehouse manager transforms and merge the source

data into the temporary store into the published data warehouse.

Backup the data in the data warehouse.

Warehouse Manager archives the data that has reached the end of its captured
life.
Note: Warehouse Manager also analyses query profiles to determine index and
aggregations are appropriate.

QUERY MANAGER

Query Manager is responsible for directing the queries to the suitable tables.

By directing the queries to appropriate table the query request and response
process is speed up.

Query Manager is responsible for scheduling the execution of the queries


posed by the user.

QUERY MANAGER ARCHITECTURE


Query Manager includes the following:

The query redirection via C tool or RDBMS.

Stored procedures.

Query Management tool.

Query Scheduling via C tool or RDBMS.


DETAILED INFORMATION
The following diagram shows the detailed information

The detailed information is not kept online rather is aggregated to the next level of
detail and then archived to the tape. The detailed information part of data
warehouse keeps the detailed information in the star flake schema. The detailed
information is loaded into the data warehouse to supplement the aggregated data.
Note: If the detailed information is held offline to minimize the disk storage we
should make sure that the data has been extracted, cleaned up, and transformed then
into star flake schema before it is archived.
In general, all data warehouse systems have the following layers:
Data Source Layer
Data Extraction Layer
Staging Area
ETL Layer
Data Storage Layer
Data Logic Layer
Data Presentation Layer
Metadata Layer
System Operations Layer

The picture below shows the relationships among the different components of the
data warehouse architecture:

Each component is discussed individually below:


Data Source Layer
This represents the different data sources that feed data into the data warehouse. The
data source can be of any format -- plain text file, relational database, other types of
database, Excel file, etc., can all act as a data source.
Many different types of data can be a data source:

Operations -- such as sales data, HR data, product data, inventory data,


marketing data, systems data.
Web server logs with user browsing data.
Internal market research data.
Third-party data, such as census data, demographics data, or survey data.

All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the data warehouse system. There is likely
some minimal data cleansing, but there is unlikely any major data transformation.

Staging Area
This is where data sits prior to being scrubbed and transformed into a data
warehouse / data mart. Having one common area makes it easier for subsequent
data processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied to transform the data
from a transactional nature to an analytical nature. This layer is also where data
cleansing happens. The ETL design phase is often the most time-consuming phase in
a data warehousing project, and an ETL tool is often used in this layer.
Data Storage Layer
This is where the transformed and cleansed data sit. Based on scope and
functionality, 3 types of entities can be found here: data warehouse, data mart, and
operational data store (ODS). In any given system, you may have just one of the
three, two of the three, or all three types.
Data Logic Layer
This is where business rules are stored. Business rules stored here do not affect the
underlying data transformation rules, but do affect what the report looks like.
Data Presentation Layer
This refers to the information that reaches the users. This can be in a form of a
tabular / graphical report in a browser, an emailed report that gets automatically
generated and sent every day, or an alert that warns users of exceptions, among
others. Usually a tool and/or a reporting tool are used in this layer.
Metadata Layer
This is where information about the data stored in the data warehouse system is
stored. A logical data model would be an example of something that's in the
metadata layer. A metadata is often used to manage metadata.
System Operations Layer

This layer includes information on how the data warehouse system operates, such as
ETL job status, system performance, and user access history.

OTHER DEFINITIONS
Data Warehouse: A data structure that is optimized for distribution. It collects and
stores integrated sets of historical data from multiple operational systems and feeds
them to one or more data marts. It may also provide end-user access to support
enterprise views of data.
Data Mart: A data structure that is optimized for access. It is designed to facilitate
end-user analysis of data. It typically supports a single, analytic application used by
a distinct set of workers.
Staging Area: Any data store that is designed primarily to receive data into a
warehousing environment.
Operational Data Store: A collection of data that addresses operational needs of
various operational units. It is not a component of a data warehousing architecture,
but a solution to operational needs.
OLAP (On-Line Analytical Processing): A method by which multidimensional
analysis occurs.
Multidimensional Analysis: The ability to manipulate information by a variety of
relevant categories or dimensions to facilitate analysis and understanding of the
underlying data. It is also sometimes referred to as drilling-down, drilling-across
and slicing and dicing
Star Schema: A means of aggregating data based on a set of known dimensions. It
stores data multi-dimensionally in a two dimensional Relational Database
Management System (RDBMS), such as Oracle.
Snowflake Schema: An extension of the star schema by means of applying
additional dimensions to the dimensions of a star schema in a relational
environment.
Multidimensional Database: Also known as MDDB or MDDBS. A class of
proprietary, non-relational database management tools that store and manage data

in a multidimensional manner, as opposed to the two dimensions associated with


traditional relational database management systems.
OLAP Tools: A set of software products that attempt to facilitate multidimensional
analysis. Can incorporate data acquisition, data access, data manipulation, or any
combination thereof.

METADATA RESPIRATORY
The Metadata Respiratory is an integral part of data warehouse system. The
Metadata Respiratory contains the following metadata:
Business Metadata - This metadata has the data ownership information, business
definition and changing policies.
Operational Metadata -This metadata includes currency of data and data lineage.
Currency of data means whether data is active, archived or purged. Lineage of data
means history of data migrated and transformation applied on it.
Data for mapping from operational environment to data warehouse -This metadata
includes source databases and their contents, data extraction, data partition,
cleaning, transformation rules, data refresh and purging rules.
The algorithms for summarization - This includes dimension algorithms, data on
granularity, aggregation, summarizing etc.

DATA MART
A data mart is a subject-oriented archive that stores data and uses the retrieved set of
information to assist and support the requirements involved within a particular
business function or department. Data marts exist within a single
organizational data warehouse repository
A data mart is a repository of data gathered from operational data and other sources
that is designed to serve a particular community of knowledge workers.
A data mart is a repository of data that is designed to serve a particular community
of knowledge workers.

Data marts improve end-user response time by allowing users to have access to the
specific type of data they need to view most often by providing the data in a way
that supports the collective view of a group of users.
Metadata is simply defined as data about data. The data that are used to represent
other data is known as metadata. For example the index of a book serves as metadata
for the contents in the book. In other words we can say that metadata is the
summarized data that leads us to the detailed data. In terms of data warehouse we
can define metadata as following.

Metadata is a road map to data warehouse.

Metadata in data warehouse define the warehouse objects.

The metadata act as a directory. This directory helps the decision support
system to locate the contents of data warehouse.

Note: In data warehouse we create metadata for the data names and definitions of a
given data warehouse. Along with this metadata additional metadata are also
created for time stamping any extracted data, the source of extracted data.
Categories of Metadata

The metadata can be broadly categorized into three categories:

Business Metadata - This metadata has the data ownership information, business
definition and changing policies.
Technical Metadata - Technical metadata includes database system names, table
and column names and sizes, data types and allowed values. Technical metadata
also includes structural information such as primary and foreign key attributes
and indices.
Operational Metadata - This metadata includes currency of data and data
lineage. Currency of data means whether data is active, archived or purged.
Lineage of data means history of data migrated and transformation applied on it.

ROLE OF METADATA
Metadata has very important role in data warehouse. The role of metadata in
warehouse is different from the warehouse data yet it has very important role. The
various roles of metadata are explained below.

The metadata act as a directory.

This directory helps the decision support system to locate the contents of data
warehouse.

Metadata helps in decision support system for mapping of data when data are
transformed from operational environment to data warehouse environment.

Metadata helps in summarization between current detailed data and highly


summarized data.

Metadata also helps in summarization between lightly detailed data and


highly summarized data.

Metadata are also used for query tools.

Metadata are used in reporting tools.

Metadata are used in extraction and cleansing tools.

Metadata are used in transformation tools.

Metadata also plays important role in loading functions.

DIAGRAM TO UNDERSTAND ROLE OF METADATA.

WHY TO CREATE DATA MART

The following are the reasons to create data mart:

To partition data in order to impose access control strategies.

To speed up the queries by reducing the volume of data to be scanned.

To segment data into different hardware platforms.

To structure data in a form suitable for a user access tool.

Note: Do not data mart for any other reason since the operation cost of data marting
could be very high. Before data marting, make sure that data marting strategy is
appropriate for your particular solution.
Steps to determine that data mart appears to fit the bill
Following steps need to be followed to make cost effective data marting:

Identify the Functional Splits

Identify User Access Tool Requirements

Identify Access Control Issues

POINTS TO REMEMBER ABOUT DATA MARTS:

Window based or Unix/Linux based servers are used to implement data


marts. They are implemented on low cost server.
The implementation cycle of data mart is measured in short period of time i.e.
in weeks rather than months or years.
The life cycle of a data mart may be complex in long run if it's planning and
design is not organisation-wide.
Data mart is small in size.
Data mart is customized by department.
The source of data mart is departmentally structured data warehouse.
Data mart is flexible.

DATA WAREHOUSE V/S DATA MART

DATA WAREHOUSE:

Holds multiple subject areas

Holds very detailed information

Works to integrate all data sources

Does not necessarily use a dimensional model but feeds dimensional models.

DATA MART:

Often holds only one subject area- for example, Finance, or Sales

May hold more summarized data (although many hold full detail)

Concentrates on integrating information from a given subject area or set of


source systems
Is built focused on a dimensional model using a star schema.

REASONS FOR CREATING A DATA MART

Easy access to frequently needed data

Creates collective view by a group of users

Improves end-user response time

Ease of creation

Lower cost than implementing a full data warehouse

Potential users are more clearly defined than in a full data warehouse

Contains only business essential data and is less cluttered.

DECISION SUPPORT SYSTEM (DDS)


Decision support systems are interactive software-based systems intended to help
managers in decision making by accessing large volume of information generated
from various related information systems involved in organizational business
processes, like, office automation system, transaction processing system etc.

DSS uses the summary information, exceptions, patterns and trends using the
analytical models. Decision Support System helps in decision making but does not
always give a decision itself. The decision makers compile useful information from
raw data, documents, personal knowledge, and/or business models to identify and
solve problems and make decisions.
Programmed and Non-programmed Decisions
There are two types of decisions - programmed and non-programmed decisions.
Programmed decisions are basically automated processes, general routine work,
where:

These decisions have been taken several times

These decisions follow some guidelines or rules


For example, selecting a reorder level for inventories, is a programmed decision
Non-programmed decisions occur in unusual and non-addressed situations, so:

It would be a new decision

There will not be any rules to follow

These decisions are made based on available information

These decisions are based on the manger's discretion, instinct, perception and
judgment
For example, investing in a new technology is a non-programmed decision
Decision support systems generally involve non-programmed decisions. Therefore,
there will be no exact report, content or format for these systems. Reports are
generated on the fly.

ATTRIBUTES OF A DSS

Adaptability and flexibility

High level of Interactivity

Ease of use

Efficiency and effectiveness

Complete control by decision-makers.

Ease of development

Extendibility

Support for modelling and analysis

Support for data access

Standalone, integrated and Web-based


Characteristics of a DSS

Support for decision makers in semi structured and unstructured problems.

Support for managers at various managerial levels, ranging from top


executive to line managers.

Support for individuals and groups. Less structured problems often requires
the involvement of several individuals from different departments and organization
level.

Support for interdependent or sequential decisions.

Support for intelligence, design, choice, and implementation.

Support for variety of decision processes and styles

DSSs are adaptive over time.

BENEFITS OF DSS

Improves efficiency and speed of decision making activities

Increases the control, competitiveness and capability of futuristic decision


making of the organization

Facilitates interpersonal communication

Encourages learning or training

Since it is mostly used in non-programmed decisions, it reveals new


approaches and sets up new evidences for an unusual decision
Helps automate managerial processes

COMPONENTS OF A DSS
Following are the components of the Decision Support System:

Database Management System (DBMS): To solve a problem the necessary


data may come from internal or external database. In an organization, internal data
are generated by a system such as TPS and MIS.External data come from a variety of
sources such as newspapers, online data services, databases (financial, marketing,
human resources).

Model Management system: It stores and accesses models that managers use
to make decisions. Such models are used for designing manufacturing facility,
analyzing the financial health of an organization. Forecasting demand of a product
or service etc.
Support Tools: Support tools like online help; pull down menus, user interfaces,
graphical analysis, error correction mechanism, facilitates the user interactions with
the system.
Classification of DSS
There are several ways to classify DSS. Hoi Apple and Whinstone classify DSS in
following:

Text Oriented DSS: It contains textually represented information that could


have a bearing on decision. It allows documents to be electronically created, revise
and viewed as needed
Database Oriented DSS: Database plays a major role here; it contains
organized and highly structured data.

Spreadsheet Oriented DSS: it contains information in spread sheets that


allows create, view, modify procedural knowledge and also instruct the system to
execute self-contained instructions. The most popular tool is Excel and Lotus 1-2-3.
Solver Oriented DSS: it is based on a solver, which is an algorithm or
procedure written for performing certain calculations and particular program type.
Rules Oriented DSS: It follows certain procedures adopted as rules.
Rules Oriented DSS: Procedures are adopted in rules oriented DSS. Export
system is the example.
Compound DSS: It is built by using two or more of the five structures
explained above.

TYPES OF DSS
Following are some typical DSSs:

Status Inquiry System: helps in taking operational management level or


middle level management decisions, for example daily schedules of jobs to machines
or machines to operators.
Data Analysis System: needs comparative analysis and makes use of formula
or an algorithm, for example cash flow analysis, inventory analysis etc.
Information Analysis System: In this system data is analyzed and the
information report is generated. For example, sales analysis, accounts receivable
systems, market analysis etc.
Accounting System: keep tracks of accounting and finance related
information, for example, final account, accounts receivables, accounts payables etc.
that keep track of the major aspects of the business.
Model Based System: simulation models or optimization models used for
decision- making used infrequently and creates general guidelines for operation or
management.

EXECUTIVE SUPPORT SYSTEM (ESS)


Executive support systems are intended to be used by the senior managers directly to
provide support to non-programmed decisions in strategic management.
These information are often external, unstructured and even uncertain. Exact scope
and context of such information is often not known beforehand.

This information is intelligence based:

Market intelligence

Investment intelligence

Technology intelligence
Examples of Intelligent Information

Following are some examples


of

intelligent

information,

which is often source of an


ESS:

External databases

Technology reports like


patent records etc.

Technical reports from


consultants

Market

reports

Confidential information about competitors

Speculative information like market conditions

Government policies

Financial reports and information

ADVANTAGES OF ESS:

Easy for upper level executive to use

Ability to analyze trends

Augmentation of managers' leadership capabilities

Enhance personal thinking and decision making

Contribution to strategic control flexibility

Enhance organizational competitiveness in the market place

Increased executive time horizons.

Better reporting system

Improved mental model of business executive

Help improve consensus building and communication

Improve office automation

Reduce time for finding information

Better understanding

Time management

Increased communication capacity and quality

DISADVANTAGE OF ESS

Functions are limited

Hard to quantify benefits

Executive may encounter information overload

System may become slow

Difficult to keep current data

May lead to less reliable and insecure data

Excessive cost for small company

KNOWLEDGE MANAGEMENT SYSTEM (KMS)


All the systems we are discussing here come under knowledge management
category. A knowledge management system is not radically different from all these
information systems, but it just extends the already existing systems by assimilating
more information.
As we have seen data is raw facts, information is processed and/or interpreted data
and knowledge is personalized information.
What is knowledge?

personalized information

state of knowing and understanding

an object to be stored and manipulated

a process of applying expertise

a condition of access to information

potential to influence action

Sources of Knowledge of an Organization

Intranet
Data warehouses and knowledge repositories
Decision support tools
Groupware for supporting collaboration
Networks of knowledge workers
Internal expertise

DEFINITION OF KMS

Knowledge management comprises a range of practices used in an organization to identify,


create represent distribute and enable adoption to insight and experience. Such insights and
experience comprise knowledge, either embodied in individual or embedded in organizational
processes and practices.
PURPOSE OF A KMS

Improved performance

Competitive advantage

Innovation

Sharing of knowledge

Integration

Continuous improvement by:


o Driving strategy
o Starting new lines of business
o Solving problems faster
o Developing professional skills
o Recruit and retain talent

ACTIVITIES IN KNOWLEDGE MANAGEMENT

Start with the business problem and the business value to be delivered first.

Identify what kind of strategy to pursue to deliver this value and address the
KM problem

Think about the system required from a people and process point of view.

Finally, think about what kind of technical infrastructure are required to


support the people and processes.

Implement system and processes with appropriate change management and


iterative staged release.

LEVEL OF KNOWLEDGE MANAGEMENT

DATA WAREHOUSING - SYSTEM PROCESSES


We have fixed number of operations to be applied on operational databases and we
have well defined techniques such as use normalized data, keep table small etc.
These techniques are suitable for delivering a solution. But in case of decision
support system we do not know what query and operation need to be executed in
future. Therefore techniques applied on operational databases are not suitable for
data warehouses.
In this chapter well focus on designing data warehousing solution built on the top
open-system technologies like UNIX and relational databases.

PROCESS FLOW IN DATA WAREHOUSE


There are four major processes that build a data warehouse. Here is the list of four
processes:

Extract and load data.

Cleaning and transforming the data.

Backup and Archive the data.

Managing queries & directing them to the appropriate data sources.

Extract and Load Process

The Data Extraction takes data from the source systems.

Data load takes extracted data and loads it into data warehouse.
Note: Before loading the data into data warehouse the information extracted from
external sources must be reconstructed.
Points to remember while extract and load process:

Controlling the process

When to Initiate Extract

Loading the Data

CONTROLLING THE PROCESS


Controlling the process involves determining that when to start data extraction and
consistency check on data. Controlling process ensures that tools, logic modules, and
the programs are executed in correct sequence and at correct time.
WHEN TO INITIATE EXTRACT
Data need to be in consistent state when it is extracted i.e. the data warehouse should
represent single, consistent version of information to the user.

For example in a customer profiling data warehouse in telecommunication sector it is


illogical to merge list of customers at 8 pm on Wednesday from a customer database
with the customer subscription events up to 8 pm on Tuesday. This would mean that
we are finding the customers for whom there is no associated subscription.
LOADING THE DATA
After extracting the data it is loaded into a temporary data store. Here in the
temporary data store it is cleaned up and made consistent.
Note: Consistency checks are executed only when all data sources have been loaded
into temporary data store.
Clean and Transform Process
Once data is extracted and loaded into temporary data store it is the time to perform
Cleaning and Transforming. Here is the list of steps involved in Cleaning and
Transforming:

Clean and Transform the loaded data into a structure.

Partition the data.


CLEAN AND TRANSFORM THE LOADED DATA INTO A STRUCTURE
This will speed up the queries. This can be done in the following ways:

Make sure data is consistent within itself.

Make sure data is consistent with other data within the same data source.

Make sure data is consistent with data in other source systems.

Make sure data is consistent with data already in the warehouse.


Transforming involves converting the source data into a structure. Structuring the
data will result in increases query performance and decreases operational cost.
Information in data warehouse must be transformed to support performance
requirement from the business and also the ongoing operational cost.

PARTITION THE DATA


It will optimize the hardware performance and simplify the management of data
warehouse. In this we partition each fact table into a multiple separate partitions.

AGGREGATION
Aggregation is required to speed up the common queries. Aggregation relies on the
fact that most common queries will analyse a subset or an aggregation of the
detailed data.

BACKUP AND ARCHIVE THE DATA


In order to recover the data in event of data loss, software failure or hardware failure
it is necessary to backed up on regular basis. Archiving involves removing the old
data from the system in a format that allow it to be quickly restored whenever
required.
For example in a retail sales analysis data warehouse, it may be required to keep data
for 3 years with latest 6 months data being kept online. In this kind of scenario there
is often requirement to be able to do month-on-month comparisons for this year and
last year. In this case we require some data to be restored from the archive.

QUERY MANAGEMENT PROCESS


This process performs the following functions

This process manages the queries.

This process speed up the queries execution.

This Process directs the queries to most effective data sources.

This process should also ensure that all system sources are used in most
effective way.

This process is also required to monitor actual query profiles.

Information in this process is used by warehouse management process to


determine which aggregations to generate.

This process does not generally operate during regular load of information
into data warehouse.

DATA WAREHOUSING - OLAP

INTRODUCTION

Online Analytical Processing Server (OLAP) is based on multidimensional data


model. It allows the managers, analysts to get insight the information through fast,
consistent, interactive access to information. In this chapter we will discuss about
types of OLAP, operations on OLAP, Difference between OLAP and Statistical
Databases and OLTP.
Feature

OLTP

OLAP

Purpose

Run day-to-day operation

Information retrieval and analysis

Structure

RDBMS

RDBMS

Data Model

Normalized

Multidimensional

Access

SQL

SQL plus data analysis extensions

Type of Data

Data that runs the business Data to analyse the business

Condition of data Changing, incomplete

Historical, descriptive

TYPES OF OLAP SERVERS


We have four types of OLAP servers that are listed below.

Relational OLAP(ROLAP)

Multidimensional OLAP (MOLAP)

Hybrid OLAP (HOLAP)

Specialized SQL Servers

RELATIONAL OLAP (ROLAP)


The Relational OLAP servers are placed between relational back-end server and
client front-end tools. To store and manage warehouse data the Relational OLAP use
relational or extended-relational DBMS.
ROLAP includes the following.

Implementation of aggregation navigation logic.

Optimization for each DBMS back end.

Additional tools and services.

MULTIDIMENSIONAL OLAP (MOLAP)


Multidimensional OLAP (MOLAP) uses the array-based multidimensional storage
engines for multidimensional views of data. With multidimensional data stores, the
storage utilization may be low if the data set is sparse. Therefore many MOLAP
Server uses the two level of data storage representation to handle dense and sparse
data sets.

HYBRID OLAP (HOLAP)


The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both
the higher scalability of ROLAP and faster computation of MOLAP. HOLAP server
allows storing the large data volumes of detail data. The aggregations are stored
separated in MOLAP store.
Specialized SQL Servers
specialized SQL servers provides advanced query language and query processing
support for SQL queries over star and snowflake schemas in a read-only
environment.

OLAP Operations
As we know that the OLAP server is based on the multidimensional view of data
hence we will discuss the OLAP operations in multidimensional data.

Here is the list of OLAP operations.

Roll-up

Drill-down

Slice and dice

Pivot (rotate)

ROLL-UP
This operation performs aggregation on a data cube in any of the following way:

By climbing up a concept hierarchy for a dimension

By dimension reduction.
Consider the following diagram showing the roll-up operation.

The roll-up operation is performed by climbing up a concept hierarchy for the


dimension location.

Initially the concept hierarchy was "street < city < province < country".

On rolling up the data is aggregated by ascending the location hierarchy from


the level of city to level of country.

The data is grouped into cities rather than countries.

When roll-up operation is performed then one or more dimensions from the
data cube are removed.

DRILL-DOWN
Drill-down operation is reverse of the roll-up. This operation is performed by either
of the following way:

By stepping down a concept hierarchy for a dimension.

By introducing new dimension.


Consider the following diagram showing the drill-down operation:

The drill-down operation is performed by stepping down a concept hierarchy


for the dimension time.

Initially the concept hierarchy was "day < month < quarter < year."

On drill-up the time dimension is descended from the level quarter to the
level of month.

When drill-down operation is performed then one or more dimensions from


the data cube are added.

It navigates the data from less detailed data to highly detailed data.

SLICE

The slice operation performs selection of one dimension on a given cube and gives us
a new sub cube. Consider the following diagram showing the slice operation.

The Slice operation is performed for the dimension time using the criterion

time ="Q1".
It will form a new sub cube by selecting one or more dimensions.

DICE
The Dice operation performs selection of two or more dimension on a given cube
and gives us a new sub cube. Consider the following diagram showing the dice
operation:

The dice operation on the cube based on the following selection criteria that involve
three dimensions.

(location = "Toronto" or "Vancouver")

(time = "Q1" or "Q2")

(item =" Mobile" or "Modem").

PIVOT
The pivot operation is also known as rotation. It rotates the data axes in view in
order to provide an alternative presentation of data. Consider the following diagram
showing the pivot operation.

In this the item and location axes in 2-D slice are rotated.

DATA WAREHOUSING - FUTURE ASPECTS

Following are the future aspects of Data Warehousing.

As we have seen that the size of the open database has grown approximately
double the magnitude in last few years. This change in magnitude is of greater
significance.

As the size of the databases grows, the estimates of what constitutes a very
large database continue to grow.

The Hardware and software that are available today do not allow keeping a
large amount of data online. For example a Telco call record requires 10TB of data to
be kept online which is just a size of one month record. If it requires keeping record
of sales, marketing customer, employee etc. then the size will be more than 100 TB.

The record not only contains the textual information but also contain some
multimedia data. Multimedia data cannot be easily manipulated as text data.
Searching the multimedia data is not an easy task whereas the textual information
can be retrieved by the relational software available today.

Apart from size planning, building and running ever-larger data warehouse
systems are very complex. As the number of users increases the size of the data
warehouse also increases. These users will also require to access to the system.

With growth of internet there is requirement of users to access data online.

QUESTION AND ANSWER OF DATA WAREHOUSE


Q: Define Data Warehouse?

A: Data warehouse is Subject Oriented, Integrated, Time-Variant and Non-volatile


collection of data that support management's decision making process.
Q: What does the subject oriented data warehouse signifies?
A: Subject oriented signifies that the data warehouse stores the information around a
particular subject such as product, customer, sales etc.
Q: List any five applications of Data Warehouse?
A: Some applications include financial services, Banking Services, Customer goods,
Retail Sectors, Controlled Manufacturing.
Q: What does OLAP and OLTP stand for?
A: OLAP is acronym of Online Analytical Processing and OLTP is acronym of
Online Transactional Processing
Q: What is the very basic difference between data warehouse and Operational
Databases?
A: Data warehouse contains the historical information that is made available for
analysis of the business whereas the Operational database contains the current
information that is required to run the business.
Q: List the Schema that Data Warehouse System implements?
A: Data Warehouse can implement Star Schema, Snowflake Schema or the Fact
Constellation Schema
Q: What is Data Warehousing?
A: Data Warehousing is the process of constructing and using the data warehouse.
Q: List the process that are involved in Data Warehousing?
A: Data Warehousing involves data cleaning, data integration
consolidations.
Q: List the functions of data warehouse tools and utilities?

and

data

A: The functions performed by Data warehouse tool and utilities are Data Extraction,
Data Cleaning, Data Transformation, Data Loading and Refreshing
Q: What do you mean by Data Extraction?
A: Data Extraction means gathering the data from multiple heterogeneous sources.
Q: Define Metadata?
A: Metadata is simply defined as data about data. In other words we can say that
metadata is the summarized data that lead us to the detailed data.
Q: What does Metadata Respiratory contains?
A: Metadata respiratory contains Definition of data warehouse, Business Metadata,
Operational Metadata, Data for mapping from operational environment to data
warehouse and the Algorithms for summarization
Q: How does a Data Cube help?
A: Data cube help us to represent the data in multiple dimensions. The data cube is
defined by dimensions and facts.
Q: Define Dimension?

A: The dimensions are the entities with respect to which an enterprise keeps the
records.
Q: Explain Data mart?
A: Data mart contains the subset of organisation-wide data. This subset of data is
valuable to specific group of an organisation. In other words we can say that data
mart contains only that data which is specific to a particular group.
Q: What is Virtual Warehouse?
A: The view over an operational data warehouse is known as virtual warehouse.
Q: List the phases involved in Data warehouse delivery Process?
A: The stages are IT strategy, Education, Business Case Analysis, technical Blueprint,
Build the version, History Load, Ad hoc query, Requirement Evolution, Automation,
Extending Scope.
Q: Explain Load Manager?
A: This Component performs the operations required to extract and load process.
The size and complexity of load manager varies between specific solutions from data
warehouse to data warehouse.
Q: Define the function of Load Manager?
A: Extract the data from source system. Fast Load the extracted data into temporary
data store. Perform simple transformations into structure similar to the one in the
data warehouse.
Q: Explain Warehouse Manager?
A: Warehouse manager is responsible for the warehouse management process. The
warehouse manager consists of third party system software, C programs and shell
scripts. The size and complexity of warehouse manager varies between specific
solutions.
Q: Define functions of Warehouse Manager?
A: The Warehouse Manager performs consistency and referential integrity checks,
Creates the indexes, business views, partition views against the base data, transforms
and merge the source data into the temporary store into the published data
warehouse, Backup the data in the data warehouse and archives the data that has
reached the end of its captured life.
Q: What is Summary Information?
A: Summary Information is the area in data warehouse where the predefined
aggregations are kept.
Q: What does the Query Manager responsible for?
A: Query Manager is responsible for directing the queries to the suitable tables.
Q: List the types of OLAP server?
A: There are four types of OLAP Server namely Relational OLAP, Multidimensional
OLAP, Hybrid OLAP, and Specialized SQL Servers
Q: Which one is faster Multidimensional OLAP or Relational OLAP?

A: Multidimensional OLAP is faster than the Relational OLAP


Q: List the functions performed by OLAP?
A: The functions such as roll-up, drill-down, slice, dice, and pivot are performed by
OLAP
Q: How many dimensions are selected in Slice operation?
A: Only one dimension is selected for the slice operation.
Q: How many dimensions are selected in dice operation?
A: For dice operation two or more dimensions are selected for a given cube.
Q: How many fact tables are there in Star Schema?
A: There is only one fact table in Star Schema.
Q: What is Normalization?
A: The normalization split up the data into additional tables.
Q: out of Star Schema and Snowflake Schema, the dimension table is normalised?
A: The snowflake schema uses the concept of normalization.
Q: What is the benefit of Normalization?
A: Normalization helps to reduce the data redundancy.
Q: Which language is used for defining Schema Definition?
A: Data Mining Query Language (DMQL) id used for Schema Definition.
Q: What language is the base of DMQL?
A: DMQL is based on Structured Query Language (SQL)
Q: What are the reasons for partitioning?
A: Partitioning is done for various reasons such as easy management, to assist
backup recovery, to enhance performance.
Q: What kind of costs is involved in Data Martin?
A: Data Marting involves Hardware & Software cost, Network access cost and Time
cost.

FACTOR ANALYSIS

WHY USE FACTOR ANALYSIS?


Factor analysis is a useful tool for investigating variable relationships for complex
concepts such as socioeconomic status, dietary patterns, or psychological scales.
It allows researchers to investigate concepts that are not easily measured directly by
collapsing a large number of variables into a few interpretable underlying factors.
WHAT IS A FACTOR?
The key concept of factor analysis is that multiple observed variables have similar
patterns of responses because of
For example, people may respond similarly to questions about income, education,
and occupation, which are all associated with the latent variable socioeconomic
status.
In every factor analysis, there is the same number of factors as there are variables.
Each factor captures a certain amount of the overall variance in the observed
variables, and the factors are always listed in order of how much variation they
explain.
The eigen value is a measure of how much of the variance of the observed variables a
factor explains. Any factor with an eigen value 1 explains more variance than a
single observed variable.
So if the factor for socioeconomic status had an Eigen value of 2.3 it would explain as
much variance as 2.3 of the three variables. This factor, which captures most of the
variance in those three variables, could then be used in other analyses.
The factors that explain the least amount of variance are generally discarded.
Deciding how many factors are useful to retain will be the subject of another post.
WHAT ARE FACTOR LOADINGS?
The relationship of each variable to the underlying factor is expressed by the socalled factor loading. Here is an example of the output of a simple factor analysis
looking at indicators of wealth, with just six variables and two resulting factors.

Variables

Factor 1

Factor 2

Income

0.65

0.11

Education

0.59

0.25

Occupation

0.48

0.19

House value

0.38

0.60

Number of public parks in 0.13


neighbourhood

0.57

Number of violent crimes per 0.23


year in neighbourhood

0.55

The variable with the strongest association to the underlying latent variable. Factor 1,
is income, with a factor loading of 0.65.
Since factor loadings can be interpreted like standardized regression coefficients, one
could also say that the variable income has a correlation of 0.65 with Factor 1. This
would be considered a strong association for a factor analysis in most research fields.
Two other variables, education and occupation, are also associated with Factor 1.
Based on the variables loading highly onto Factor 1, we could call it Individual
socioeconomic status.
House value, number of public parks, and number of violent crimes per year,
however, have high factor loadings on the other factor, Factor 2. They seem to
indicate the overall wealth within the neighbourhood, so we may want to call Factor
2 Neighbourhood socioeconomic status.
Notice that the variable house value also is marginally important in Factor 1 (loading
= 0.38). This makes sense, since the value of a persons house should be associated
with his or her income.

FEATURES OF FACTOR ANALYSIS

Data reduction tool


Removes redundancy or duplication from a set of
Correlated variables
Represents correlated variables with a smaller set of derived variables.
Factors are formed that are relatively independent of one another.
Two types of variables:

LATENT VARIABLES: FACTORS


OBSERVED VARIABLES
Some Applications of Factor Analysis
1. Identification of Underlying Factors:
clusters variables into homogeneous sets
creates new variables (i.e. factors)
allows us to gain insight to categories
2. Screening of Variables:
identifies groupings to allow us to select one variable to represent
many
useful in regression (recall collinearity)
3. Summary:
Allows us to describe many variables using a few factors
4. Clustering of objects:
Helps us to put objects (people) into categories depending on their
factor scores

Vous aimerez peut-être aussi