Data WareHousing

DATA WAREHOUSING
A Seminar Report
Submitted by
Ayush Barnawal
13BTCSE007
inpartialfulfillmentfortheawardofthedegreeof
BACHELOROFTECHNOLOGY
IN
COMPUTERSCIENCE&ENGINEERING
At
Department of Computer Science & Information Technology

Shepherd School of Engineering and Technology
Sam Higginbottom Institute of Agriculture, Technology and Sciences,
Allahabad, U. P. 211007, India
November2015
ABSTRACT
Different people have different definitions for a data warehouse. The most popular definition
came from Bill Inmon, who provided the following:
A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support of management's
decision making process.
Ralph states that a data warehouse is "a copy of transaction data specifically structured for
query and analysis. "A data warehouse is a repository of an organization's electronically
stored data. Data warehouses are designed to facilitate reporting and analysis. This definition
of the data warehouse focuses on data storage. However, the means to retrieve and analyze
many references to data warehousing use this broader context. Thus, an expanded definition
for data warehousing includes business intelligence tools, tools to extract, transform, and load
data into the repository, and tools to manage and retrieve metadata.
A data warehouse can be normalized or denormalized. It can be a relational database,
multidimensional database, flat file, hierarchical database, object database, etc. Data
warehouse data often gets changed. And data warehouses often focus on a specific activity or
entity. Of course if you want to define every user as a decision maker and all activities as
decision making processes, then my assertion is false. But in my experience, the
overwhelming uses of data warehouses are for quite mundane, non-decision making purposes
rather than for grist for making decisions with wide ranging effects (so-called "strategic"
decisions.). In fact, I would assert that most of data warehouses are used for post-decision
monitoring of the effects of decisions or, as some people might say, for "operational" issues.
By the way, this is not saying that using data warehousing in the decision making process is
not a wonderful, potentially high return effort. But my caution is that though the trade press,
vendors, and many industry experts trumpet the role of data warehousing visvis decision
making, in reality we do not now have nor will we ever have a clear understanding of
decision making.
To tackle the key issues such as multimedia data indexing, similarity measures, search
methods and query processing in retrieval for large multimedia data archives, we extend the
concepts of conventional data warehouse and multimedia database to multimedia data
warehouse for effective data representation and storage. The data mining techniques helps the
authorities to view the data in the required form. The data warehousing is a collection of
decision support technologies, aimed at enabling the knowledge worker to make faster and
better decisions.
Data warehousing and data mining are technologies that deliver critical and optimally useful
information to facilitate performance analysis of business organizations. These technologies
are not only an emerging trend in information technology but also a booming market in a
range of industries.
TABLE OF CONTENTS
CHAPTER NO.
TITLE
PAGE NO.
ABSTRACT
ACKNOWLEDGEMENT
1.
INTRODUCTION
1.1 History
1.2 Definition
1.2.1 Data Warehouse: Subject-Oriented
1.2.2 Data WarehouseIntegrated
1.2.3 Data WarehouseTime Variant
1.2.4 Data WarehouseNon-Volatile
2.
LITERATURE REVIEW
2.1 EVOLUTION IN ORGANIZATION USE
2.2 DEVELOPMENT STEPS
10
2.3 ARCHITECTURE
11
2.4 DATA MART & DATA WAREHOUSE
13
2.5 DATA SOURCE & DATA EXTRACTION
13
2.6 DATA MODELING
14
2.7 STAR SCHEMA
15
2.8 SNOW FLAKE SCHEMA
16
2.9 OLAP TOOLS
17
2.10 DATA WAREHOUSE & SECURITY
18
2.11 SYSTEM ADMINISTRATION & MANAGEMENT
19
CONCLUSION
20
APPENDICES
21
REFERENCES
22
ACKNOWLEDGEMENT
I would like to take this opportunity to express my gratitude to the following people below
who have directly or indirectly helped me during the Seminar.
I would like to express my most sincere gratitude to my Seminar coordinator Mrs. Mudita
Shrivastava for her invaluable advice and patience throughout the course of this Seminar.
Without her guidance, this seminar would have been an uphill task.
Last but not least, Im also grateful to all my fellow classmates and seniors for their
continuous support and invaluable opinion, comment and ideas they had made on this
seminar.
1. INTRODUCTION
1.1 HISTORY
The concept of data warehousing dates back to the late 1980s when IBM
researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In
essence, the data warehousing concept was intended to provide an architectural model for the
flow of data from operational systems to decision support environments.
The concept attempted to address the various problems
associated with this flow, mainly the high costs associated with it. In the absence of a data
warehousing architecture, an enormous amount of redundancy was required to support
multiple decision support environments. In larger corporations it was typical for multiple
decision support environments to operate independently. Though each environment served
different users, they often required much of the same stored data. The process of gathering,
cleaning and integrating data from various sources, usually from long-term existing
operational systems (usually referred to as legacy systems), was typically in part replicated
for each environment. Moreover, the operational systems were frequently reexamined as new
decision support requirements emerged. Often new requirements necessitated gathering,
cleaning and integrating new data from "data marts" that were tailored for ready access by
users.
From this idea, the data warehouse was born as a place where relevant data could be
held for completing strategic reports for management. The key here is the word 'strategic' as
most executives were less concerned with the day to day operations than they were with a
more overall look at the model and business functions. As with all technology, over the
course of the latter half of the 20th century, we saw increased numbers and types of
databases. Many large businesses found themselves with data scattered across multiple
platforms and variations of technology, making it almost impossible for any one individual to
use data from multiple sources. A key idea within data warehousing is to take data from
multiple platforms/technologies (As varied as spreadsheets, DB2 databases, IDMS records,
and VSAM files) and place them in a common location that uses a common querying tool. In
this way operational databases could be held on whatever system was most efficient for the
operational business, while the reporting / strategic information could be held in a common
location using a common language. Data Warehouses take this even a step farther by giving
the data itself commonality by defining what each term means and keeping it standard. (An
example of this would be gender which can be referred to in many ways, but should be
standardized on a data warehouse with one common way of referring to each sex). All of this
was designed to make decision support more readily available and without affecting day to
day operations. One aspect of a data warehouse that should be stressed is that it is NOT a
location for ALL of a businesss data, but rather a location for data that is 'interesting'. Data
that is interesting will assist decision makers in making strategic decisions relative to the
organization's overall mission.
1.2 DEFINITION
The data warehouse makes an attempt to figure out "what we need", before we know we need
it.
What it actually is?

A data warehouse stores current and historical data.
This data is taken from various, perhaps incompatible, sources and stored in a uniform
format.
Several tools transform this data into meaningful business information for the
purpose of comparisons, trends and forecasting.
Data in a warehouse is not updated or changed in any way, but is only loaded and
accessed later.
Data is organized according to subject instead of application.
In general, a database is not a Data Warehouse unless it has the following two features:
It collects information from a number of different disparate sources and is the
place where this disparity is reconciled, and
It allows several different applications to make use of the same information.
1.2.1 Data Warehouse: Subject-Oriented
Organized around major subjects, such as customer, product, sales.
Focusing on the modeling and analysis of data for decision makers, not on
daily operations or transaction processing.
Provide a simple and concise view around particular subject issues by

excluding data that are not useful in the decision support process.
1.2.2 Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous data sources
Relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are applied.
Ensure consistency in naming conventions, encoding structures, attribute measures, etc.

among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
1.2.3 Data WarehouseTime Variant

The time horizon for the data warehouse is significantly longer than that of
operational systems.
Operational database: current value data.
Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain time element.
1.2.4 Data WarehouseNon-Volatile
A physically separate store of data transformed from the operational

environment.
Operational update of data does not occur in the data warehouse environment.
Does not require transaction processing, recovery, and concurrency

control mechanisms
Requires only two operations in data accessing:
Initial loading of Data and Access of data.
Conceptually Data Warehouse looks like this:
2. LITERATURE REVIEW
2.1 EVOLUTION IN ORGANIZATION USE

Organizations generally start off with relatively simple use of data warehousing. Over time,
more sophisticated use of data warehousing evolves. The following general stages of use of
the data warehouse can be distinguished:
Business intelligence has become the vendors preferred synonym for decision support. This
is because decision support has an academic connotation and, as just mentioned, decision
support systems do not necessarily support decisions. On the other hand, business intelligence
systems do not necessarily make a business more intelligent. By the way, the consultant
coined term business intelligence goes back to the late 1950s, fell out of use, was revived by
a DEC consultant, fell out of use again, and then was revived by the DW/DSS/BI world in the
late 1990s. Confusingly, business intelligence is also used as a synonym for competitive
intelligence (and is probably a more apt term for that area).
We cannot say that decision support systems or tools necessarily support the making of
decisions.
Whats in a name? As far as I know, cognitive researchers do not agree on how decisions
are made. Therefore, saying that these tools support making decisions is not a provable
statement. Nor, is it, in may opinion, an insightful way of defining these tools. It seems,
though, that 99% of the definitions of BI say something about better decisions. My wish is
that these definitions would include a cognitive model of how decisions are made and an
explanation on how the tools fit into the model.
These tools do not analyze by themselves rather they help a person analyze. In other words,
the tools facilitate analyses rather than perform analyses. Data warehousing and decision
support systems and tools do not necessarily go hand in hand many data warehouses are not
used as decision support systems. And decision support systems or tools do not necessarily
require the use of a data warehouse as a source for data. I assert that, by far, the most used
decision support tools are spreadsheets not connected in any automated way with a data
warehouse. Analyzing data, no matter what tool is being used, is difficult. Whatever the
vendors do, it will remain difficult. But it is an activity, when done well that can be quite
beneficial Data warehousing arises in an organizations need for reliable, consolidated,
unique and integrated reporting and analysis of its data, at different levels of aggregation.
The practical reality of most organization is that their data infrastructure is made up by a
collection of heterogeneous systems. For example, an organization might have one system
that handles customer-relationship, a system that handles employees, systems that handles
8
sales data or production data, yet another system for finance and budgeting data etc. In
practice, these systems are often poorly or not at all integrated and simple questions like:
"How much time did sales person A spend on customer C, how much did we sell to
Customer C, was customer C happy with the provided service, Did Customer C pay his bills"
can be very hard to answer, even though the information is available "somewhere" in the
different data systems.
Another problem is that ERP systems are designed to support relevant operations. For
example, a finance system might keep track of every single stamp bought. When it was
ordered, when it was delivered, when it was paid and the system might offer accounting
principles (like double bookkeeping) that further complicates the data model. Such
information is great for the person in charge of buying "stamps" or the accountant trying to
sort out an irregularity, but the CEO is definitely not interested in such detailed information,
the CEO wants to know stuff like "What's the cost?", "What's the revenue?", "did our latest
initiative reduce costs?".
Yet another problem might be that the organization is, internally, in disagreement about
which data is correct. For example, the sales department might have one view of its costs,
while the finance department has another view of that cost. In such cases the organization can
spend unlimited time discussing who's got the correct view of the data. It is partly the purpose
of Data warehousing to bridge such problems. It is important to note that in data warehousing
the source data systems are considered as given: It is not the task of the data warehousing
consultant to figure out, that since the problem is that the CRM system identifies a person by
initials, while the Employee-Time-Management system identifies a person by full name while
the ERP system identifies a person by social security number; and since a person can change
his name: things do not work and the organization should invest in and implement one or two
new systems to handle CRM, ERP etc. in a more consistent manner.
Rather, the data warehousing consultant is charged with making the data appear consistent,
integrated and consolidated despite the problems in the underlying source systems. The data
warehousing consultant achieves this by employing different data warehousing
techniques, creating one or more new data repositories (i.e. the data warehouse) whose
data model(s) support the needed reporting and analysis.
With the increase in business competition, there is a need to obtain and analysis business
data faster. A lot of important business data is in operational database systems. However,
these systems are not designed for business data analysis due to the following reasons:
Data model is normalized for speed and not for data analysis.
Data model is not grouped into subject areas for analysis.
Data model is not dimensional.
Operational database can't afford the resources to perform computation intensive

query during analysis.
There's no cross reference information between data from the different operational
databases, i.e. between financial and operational database.
Historical data may not be found in operational database for trend analysis.
Data warehouse has data description and data browsing facilities.
Operational data changes over time.
Data warehouse can provide easy information access for business people to increase revenue,
profit, customer satisfaction, saving, and market share. The system can be used for different
departments in the organization.
2.2 DEVELOPMENT STEPS
The development steps for data warehouse project is similar to other information systems.
The following outlines some key steps during developing. The outline is divided into three
sections and they are planning & design, building & testing, roll out & maintenance.
Planning and Design
Business drivers
Objectives
User needs
User and sponsor expectation
Application orientation
Data sources
Data quality
To build data warehouse or data mart
Project risk
Budget plan
Time frame
10
Cost benefit analysis Project team composition (DA, DAB, OLAP development, GUI
development, query development, report development, user training, network
management, system integration)
Logical and physical data model (depends on access & usage)
The logical and physical data model design depends on the data access and usage. At the
planning & design phase, the data model is just in preliminary design.
Building and Testing
HW, SW, transformation SW, middle ware, OLAP SW, system management SW
Network infrastructure and management
Connect to source databases (flat file, ongoing connection, direct access)
Summarize or aggregate data
Prototyping
Data mining to find out data patterns.
Some data extraction & transformation software are very useful to development data
transformation routines. These tools are very useful for both the construction and the
maintenance phase. In addition, system management software can control the data extraction
processes to extract data from other database systems to the data warehouse.
Roll out & Maintenance
System growth
Performance management
System maintenance
Security
Backup, recovery
Update data
Risk management is important to the success of a data warehouse project. Some of the project
risks are:
Technology risk: i.e. new technology to the market place, new technology to the
organization, and technologies coexist, etc.
11
Complexity risk: i.e. complex data model and database process, business process
change, mission critical requirement, large number of installations, distributed system,
data re-modeling required for legacy system, etc.
Integration risk: i.e. integration with other information system, real time requirements
for the interfaces, etc.
Project team risk: i.e. team member experience, business user involvement, etc.
2.3 ARCHITECTURE
Architecture, in the context of an organization's data warehousing efforts, is a
conceptualization of how the data warehouse is built. There is no right or wrong architecture,
but rather there are multiple architectures that exist to support various environments and
situations. The worthiness of the architecture can be judged from how the conceptualization
aids in the building, maintenance, and usage of the data warehouse.
One possible simple conceptualization of a data warehouse architecture consists of the
following interconnected layers:
Operational Database layer
The source data for the data warehouse is an organization's Enterprise
Resource Planning systems fall into this layer.
Data Access layer
The interface between the operational and informational access layer Tools
to extract, transform, load data into the warehouse fall into this layer.
Metadata layer
The data directory - This is usually more detailed than an operational system
data directory. There are dictionaries for the entire warehouse and sometimes
dictionaries for the data that can be accessed by a particular reporting and analysis
tool.
Informational Access layer
The data accessed for reporting and analyzing and the tools for reporting and
analyzing data Business intelligence tools fall into this layer. And the InmonKimball differences about design methodology, discussed later in this article, have to
do with this layer.
Data warehouse reads source data from different database systems in the organization. The
source databases are usually operational databases. The following is one of the data
warehouse logical architecture:
12
Figure 1: Data Warehouse Architecture
Data warehouse reads data from multiple operational databases. The data is clean,
transformed, or aggregated. The data is either updated or inserted into the data warehouse
depending on the trend analysis requirement. In addition, cross reference data is generated
based on the new data, for example, accounting data from the accounting database needs to
be cross referenced with the facility data from the facility database.
Depending on the data model, the amount of data, and the particular query, performance can
be a problem for a data warehouse system. In the data warehouse, some tables can contain
millions of entries. Query operation to these tables can take a long time. For example, the
query performs aggregate operation to summarize the data. Also, if the query needs many join
operations or sub-queries, the performance will even be slower. These long queries can be
performed over-night in order to minimize the performance impact to the end users. Some of
these long queries can be speeded up by modifying the data model or turning the database.
Scalability is an important consideration in choosing software, hardware, and system
architecture for the data warehouse. Both the database size and the number of users for the
data warehouse can increase substantially over time. The software and hardware must be
scalable to support the new requirements.
13
There are different types of database management system such as relational database system,
object oriented database system, hierarchical database system, etc. Relational database is
usually the choice for implementing data warehouse because the following reasons:
Relational database is the most commonly used database system in the commercial
environment. Many developers already have experience with relational database
products. This reduces the learning curve for the developers.
Most of the operational database system is constructed with relational database. If the
data warehouse is also constructed with relational database, the data conversion
process between the operational database and the data warehouse can be simpler.
Also, there are many database products that enable direct data transfer between
relational database systems.
Relational database is more mature than other types of database system in terms of its
scalability, stability, and efficiency.
Relational database has less proprietary functions than other database systems. This
increases the degree of platform independent.
Object oriented database is sometime used in database application because it has a richer set
of constructs to represent the data model. For example, hierarchical data structure can be
represented well than relational database. Object oriented database provides better integration
between data and functions. Therefore object oriented database is good for application that
has both complex data structure and functions (i.e. CAD application, simulation application).
2.4 DATA MART & DATA WAREHOUSE

Data mart has similar functions as data warehouse except that data mart is a lot
smaller in size and has smaller group of users. For example, a department can design a data
mart that is tailored to the department specific needs. The data mart can contain additional
domain specific information for the department. Data mart costs less time and money to build
and the design can be more flexible.
Some software products can merge multiple data marts into a data warehouse so that the data
can be shared by the entire organization. The software product provides data management
capabilities that extract a subset of the data from the data marts to form the data warehouse.
Some suggest that this for data warehouse development is more realistic because it is a stepby-step methodology to build data warehouse.
2.5 DATA SOURCE & DATA EXTRACTION

Data warehouse reads data from multiple data sources. These data sources are usually
operational databases such as accounting information database, financial information
database, facility information database, ERP (Enterprise Resources Planning, i.e. SAP),
14
operational information database, research & engineering database, GIS (Geographical

Information System), etc.
Other external data source can be industry data, economic data, credit data, commodity (raw
material) data, meteorological data, competitor related data, demographic data, etc.
Depending on the business requirements and the types of data, the data loading frequency can
be just once, once a day, once a week, or once a month. Once a day is the most often. There
will be on going data and system administrative work required to maintain the data
warehouse.
There are different ways to implement data extraction processes. It depends on the
requirements and the technical environment. Some implements require more maintenance
effort than the others. The following lists out some implementation methods for the data
extraction process.
The data can be extracted into an ASCII report file. The file can be in fix width or in
CSV format. The ASCII report file is generated through standard report function on
the operational database system. In some situation, a custom report function is
developed.
The data can directly be extracted from the source database system. The source
database can create a single database view that contains all the necessary information.
With this database view, the data transformation process can directly request for
information and load the data into the data warehouse.
Data loading process can have errors. The problems can be data referential integrity error,
data format error, data range error, or other data quality errors. In these situations, the source
data has to be modified before it can be loaded into the data warehouse.
Depending on the data source, the source database system may need to be re-modeled in
order to produce the required data for the data warehouse. This data re-modeling work can be
use a lot of time.
2.6 DATA MODELING

Data modeling is one of the most important steps in building a data warehouse. Data
warehouse uses dimensional modeling in a relational database environment. There are two
types of table and they are dimension table and fact table. Dimension table contains
information that is relatively static over time. Fact table contains transactional type
information that changes over time. Fact table contains multiple foreign keys to dimension
tables and has some of its own attributes.
In comparison, entity relationship modeling has data table, primary table, lookup table,
characteristic table, virtual table, and summarized table.
15
Data modeling is a creative process and there can be different modeling solution for the same
set of data. The purpose of data modeling is to organize data to meet business objectives and
to provide good performance for database operation.
Meta data is important information in data modeling. It is the information about the data
model. For example "$5.64 sales amount", without metadata, the data shows as "5.64" and we
don't know what it means. Meta data captures business rules for data such as data name,
description, value range, data version, data source, and referential integrity information. The
organization of metadata can be separated into technical level and business level. The
following tables describes the information to be stored in metadata repository.
Technical Level
Business Level
Data physical location
Mapping data source & target
Data access method
Valid user entries
Program and script name
Frequency of update and usage
Dependencies
Data update responsibilities
Data transformation logic
Data security
Data refresh rules
Other business rules
Rules to resolve data inconsistencies
Data ownership
Rules for data derivation (i.e.

aggregation)
Table size estimates
Data access, drill down, and roll-up
Predefined queries and reports
Meta data can be used as a semantic layer for users to navigate through the data warehouse
without having to understand the complex physical data structure. Some metadata can be
extracted from the database management system or the data modeling case tool.
16
2.7 STAR SCHEMA

Star schema is a relational data model. Each schema has one fact table associated with
multiple dimension tables. Each data warehouse has many star schema. Star schema
organizes data for the purpose of end-user analysis. Star schema is easy to understand by enduser. Also, there are many OLAP tools that support star schema analysis. Figure 3 is an
example of a star schema.
Figure 3: Star Schema
In the above figure, dimension tables are Student table, Instructor table, Course table, and
Semester table. Information in these dimension tables are relatively static over time. Each star
schema can have multiple dimension tables. There is only one fact table for each star schema.
The fact table in figure 3 is the Attendance table. The fact table contains multiple foreign keys
to the four dimension tables. The fact table primary key is the composite of the four foreign
keys. Since the fact table contains transaction type information and the dimension table
contains relatively static information, the amount of data in the fact table is a lot more than
the amount of data in the dimension tables.
The about data model can for example provide the following query result:
List of students in a course, a major, or a minor
Instructors for a course
Courses taught by an instructor

17
List of instructors in a faculty
List of students taught by an instructor
List of instructors that teach a student
Summary of a student's grade
Total credit obtained by a student
List of courses taken by a student in a semester, or a year
Number of students in a course
Number of openings in a course
2.8 SNOW FLAKE SCHEMA

Snow flake schema is similar to star schema. It normalizes dimension table to save
data storage space. It can be used to represent hierarchies of information.
Figure 5: Snow Flake Schema
The Student table is normalized to contain foreign keys to Major and Minor tables. The
relationship between Student table to Major table is many-to-one. In other situations, if the
relationship is many-to-many, this will create a chain of tables for the dimension table. This
18
makes the data model more difficult to use and understand by the end-user. Therefore, the use
of snow flake schema can decrease the browsing performance.
In addition, the storage space saving in the dimension table is not significant in comparison to
the size of the fact table. Fact table is usually many times larger than the dimension table.
2.9 OLAP TOOLS

Online Analytical Processing (OLAP) tool is used for data analysis especially for dimensional
data model. The tool provides a front end user interface for the user to access the data
warehouse. Through the tool, the user can perform data analysis, design custom report or
query. User can perform joins, aggregations, sorts, rollup and roll-down to the data.
Roll-up is done by adding row headers from the dimension tables. Roll-down is done by
subtracting row headers.
Security features can be implemented with database view. View is a logical table derived
from the physical tables in the database. View provides a logical layer for the user to access
the database physical tables. For example, Employee is a physical table with the following
attributes:
Employee
View A
View B
First Name
Last Name
Department
Position
Phone Number
Age
Salary
Employee table has both View A and View B. View A can access all attributes in the
Employee table. View B can access all attributes except for attribute Age and Salary. View A
is used by manager in the company. View B is used by all other users.
19
2.10 DATA WAREHOUSE SECURITY

Data warehouse is an integrated repository derived from multiple source (operational
and legacy) databases. The data warehouse is created by either replicating the different source
data or transforming them to new representation. This process involves reading, cleaning,
aggregating and storing the data in the warehouse model. The software tools are used to
access the warehouse for strategic analysis, decision-making, marketing types of applications.
It can be used for inventory control of shelf stock in many departmental stores.
Medical and human genome researchers can create research data that can be either
marketed or used by a wide range of users. The information and access privileges in data
warehouse should mimic the constraints of source data. A recent trend is to create web-based
data warehouses and multiple users can create components of the warehouse and keep an
environment that is open to third party access and tools. Given the opportunity, users ask for
lots of data in great detail. Since source data can be expensive, its privacy and security must
be assured. The idea of adaptive querying can be used to limit access after some data has
been offered to the user. Based on the user profile, the access to warehouse data can be
restricted or modified.
Replication control
Replication can be viewed in a slightly different manner than perceived in traditional
literature.
For example, an old copy can be considered a replica of the current copy of the data.
A slightly out of date data can be considered as a good substitute for some users. The basic
idea is that either the warehouse keeps different replicas of the same items or creates them
dynamically. The legitimate users get the most consistent and complete copy of data while
casual users get a weak replica. Such replica may be enough to satisfy the user's need but do
not provide information that can be used maliciously or breach privacy. We have formally
defined the equivalence of replicas and this notion can be used to create replicas for different
users. The replicas may be at one central site or can be distributed to proxies who may serve
the users efficiently. In some cases the user may be given the weak replica and may be given
an upgraded replica if willing to pay or deserves it.
Aggregation and Generalization
The concept of warehouse is based on the idea of using summaries and consolidators.
This implies that source data is not available in raw form. This lends to ideas that can be used
for security. Some users can get aggregates only over a large number of records where as
others can be given for small data instances. The granularity of aggregation can be lowered
for genuine users. The generalization idea can be used to give users high level information at
first but the lower level details can be given after the security constraints are satisfied. For
example, the user may be given an approximate answer initially based on some generalization
over the domains of the database. Inheritance is another notion that will allow increasing
capability of access for users. The users can inherit access to related data after having access
to some data item.
20
Exaggeration and Misleading

These concepts can be used to mutilate the data. A view may be available to support a
particular query, but the values may be overstated in the view. For security concern, quality of
views may depend on the user involved and user can be given an exaggerated view of the
data.
For example, instead of giving any specific sales figures, views may scale up and give
only exaggerated data. In certain situations warehouse data can give some misleading
information; information which may be partially incorrect or difficult to verify the
correctness of the information. For example, a view of a companys annual report may
contain the net profit figure including the profit from sales of properties (not the actual sales
of products).
Anonymity
Anonymity is to provide user and warehouse data privacy. A user does not know the
source warehouse for his query and warehouse also does not who is the user and what
particular view a user is accessing (view may be constructed from many source databases for
that warehouse).
Note that a user must belong to the group of registered users and similarly, a user
must also get data from only legitimate warehouses. In such cases, encryption is to be used to
secure the connection between the users and warehouse so that no outside user (user who has
not registered with the warehouse) can access the warehouse.
2.11 SYSTEM ADMINISTRATION & MANAGEMENT
Some data are manually maintained is the data warehouse. These can be system related data
for the data warehouse to operate. These data is usually maintained by the system
administrator. For example, the data warehouse has information about all the data loading
processes. Scheduling program can based on these information to execute the data loading
processes and the execution status can be stored in the data warehouse for process tracking.
Also, system administrator can maintain information about user account and access privilege.
A user interface can be developed for the administrator to maintenance the information.
Some lookup data and grouping data are also manually maintained. These data is for data
analysis purposes.
21
CONCLUSION
Since the primary task of management is effective decision making, the primary task of
research, and subsequently data warehouses, is to generate accurate information for use in
that decision making.
It is imperative that an organizations data warehousing strategies reflect
changes in the internal and external business environment in addition to the direction in
which the business is traveling. Data warehouse is a good solution for storing and analyzing
large amount of data. It reads data from multiple operational databases on an ongoing basis.
Cross reference information is generated between the data from the different databases. The
data model is designed to provide good browsing performance to the end user. Data
warehouse can be seen as a centralized data repository to provide both current and historical
data to the end user.
Playing an integral role in the growth, development and success of an organization, data
Warehouses facilitate meaningful research which facilitates effective management.
22
APPENDICES
OLAP: Online Analytical Processing
DA: Data Extraction
CEO: Chief Executive Officer
DW: Data Warehouse
HW: Hardware
SW: Software
GUI: Graphical User Interface
, page 11
, page 11
, page 10
, page 9
, page 11
, page 11
, page 11
REFERENCES
DCI, (1997). The Roadmap for Data Warehouse Implementation.
Kimball, (1996). Data Warehouse Toolkit. John Wiley & Sons, Inc.
Inmon, W.H. Tech Topic: What is a Data Warehouse? Prism Solutions. Volume 1.
Wikipedia
Tutorialspoint.com
23

Data WareHousing

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data WareHousing

Transféré par

Droits d'auteur :

Formats disponibles

DATA WAREHOUSING

Department of Computer Science & Information Technology

1.2.1 Data Warehouse: Subject-Oriented

1.2.2 Data WarehouseIntegrated

1.2.3 Data WarehouseTime Variant

1.2.4 Data WarehouseNon-Volatile

2.2 DEVELOPMENT STEPS

2.4 DATA MART & DATA WAREHOUSE

2.5 DATA SOURCE & DATA EXTRACTION

2.6 DATA MODELING

2.7 STAR SCHEMA

2.8 SNOW FLAKE SCHEMA

2.9 OLAP TOOLS

2.10 DATA WAREHOUSE & SECURITY

2.11 SYSTEM ADMINISTRATION & MANAGEMENT

What it actually is?

1.2.1 Data Warehouse: Subject-Oriented

Organized around major subjects, such as customer, product, sales.

Provide a simple and concise view around particular subject issues by

1.2.2 Data WarehouseIntegrated

Constructed by integrating multiple, heterogeneous data sources

Relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are applied.

Ensure consistency in naming conventions, encoding structures, attribute measures, etc.

1.2.3 Data WarehouseTime Variant

Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly

1.2.4 Data WarehouseNon-Volatile

A physically separate store of data transformed from the operational

Does not require transaction processing, recovery, and concurrency

Initial loading of Data and Access of data.

Conceptually Data Warehouse looks like this:

2.1 EVOLUTION IN ORGANIZATION USE

Data model is not grouped into subject areas for analysis.

Data model is not dimensional.

Operational database can't afford the resources to perform computation intensive

Data warehouse has data description and data browsing facilities.

Operational data changes over time.

2.2 DEVELOPMENT STEPS

User and sponsor expectation

To build data warehouse or data mart

Logical and physical data model (depends on access & usage)

Network infrastructure and management

Connect to source databases (flat file, ongoing connection, direct access)

Summarize or aggregate data

Data mining to find out data patterns.

Figure 1: Data Warehouse Architecture

2.4 DATA MART & DATA WAREHOUSE

2.5 DATA SOURCE & DATA EXTRACTION

operational information database, research & engineering database, GIS (Geographical

2.6 DATA MODELING

Data physical location

Mapping data source & target

Data access method

Valid user entries

Program and script name

Frequency of update and usage

Data update responsibilities

Data transformation logic

Data refresh rules

Other business rules

Rules to resolve data inconsistencies

Rules for data derivation (i.e.

Table size estimates