Académique Documents
Professionnel Documents
Culture Documents
A Seminar Report
Submitted by
Ayush Barnawal
13BTCSE007
inpartialfulfillmentfortheawardofthedegreeof
BACHELOROFTECHNOLOGY
IN
COMPUTERSCIENCE&ENGINEERING
At
ABSTRACT
Different people have different definitions for a data warehouse. The most popular definition
came from Bill Inmon, who provided the following:
A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support of management's
decision making process.
Ralph states that a data warehouse is "a copy of transaction data specifically structured for
query and analysis. "A data warehouse is a repository of an organization's electronically
stored data. Data warehouses are designed to facilitate reporting and analysis. This definition
of the data warehouse focuses on data storage. However, the means to retrieve and analyze
many references to data warehousing use this broader context. Thus, an expanded definition
for data warehousing includes business intelligence tools, tools to extract, transform, and load
data into the repository, and tools to manage and retrieve metadata.
A data warehouse can be normalized or denormalized. It can be a relational database,
multidimensional database, flat file, hierarchical database, object database, etc. Data
warehouse data often gets changed. And data warehouses often focus on a specific activity or
entity. Of course if you want to define every user as a decision maker and all activities as
decision making processes, then my assertion is false. But in my experience, the
overwhelming uses of data warehouses are for quite mundane, non-decision making purposes
rather than for grist for making decisions with wide ranging effects (so-called "strategic"
decisions.). In fact, I would assert that most of data warehouses are used for post-decision
monitoring of the effects of decisions or, as some people might say, for "operational" issues.
By the way, this is not saying that using data warehousing in the decision making process is
not a wonderful, potentially high return effort. But my caution is that though the trade press,
vendors, and many industry experts trumpet the role of data warehousing visvis decision
making, in reality we do not now have nor will we ever have a clear understanding of
decision making.
To tackle the key issues such as multimedia data indexing, similarity measures, search
methods and query processing in retrieval for large multimedia data archives, we extend the
concepts of conventional data warehouse and multimedia database to multimedia data
warehouse for effective data representation and storage. The data mining techniques helps the
authorities to view the data in the required form. The data warehousing is a collection of
decision support technologies, aimed at enabling the knowledge worker to make faster and
better decisions.
Data warehousing and data mining are technologies that deliver critical and optimally useful
information to facilitate performance analysis of business organizations. These technologies
are not only an emerging trend in information technology but also a booming market in a
range of industries.
TABLE OF CONTENTS
CHAPTER NO.
TITLE
PAGE NO.
ABSTRACT
ACKNOWLEDGEMENT
1.
INTRODUCTION
1.1 History
1.2 Definition
2.
LITERATURE REVIEW
2.1 EVOLUTION IN ORGANIZATION USE
10
2.3 ARCHITECTURE
11
13
13
14
15
16
17
18
19
CONCLUSION
20
APPENDICES
21
REFERENCES
22
ACKNOWLEDGEMENT
I would like to take this opportunity to express my gratitude to the following people below
who have directly or indirectly helped me during the Seminar.
I would like to express my most sincere gratitude to my Seminar coordinator Mrs. Mudita
Shrivastava for her invaluable advice and patience throughout the course of this Seminar.
Without her guidance, this seminar would have been an uphill task.
Last but not least, Im also grateful to all my fellow classmates and seniors for their
continuous support and invaluable opinion, comment and ideas they had made on this
seminar.
1. INTRODUCTION
1.1 HISTORY
The concept of data warehousing dates back to the late 1980s when IBM
researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In
essence, the data warehousing concept was intended to provide an architectural model for the
flow of data from operational systems to decision support environments.
The concept attempted to address the various problems
associated with this flow, mainly the high costs associated with it. In the absence of a data
warehousing architecture, an enormous amount of redundancy was required to support
multiple decision support environments. In larger corporations it was typical for multiple
decision support environments to operate independently. Though each environment served
different users, they often required much of the same stored data. The process of gathering,
cleaning and integrating data from various sources, usually from long-term existing
operational systems (usually referred to as legacy systems), was typically in part replicated
for each environment. Moreover, the operational systems were frequently reexamined as new
decision support requirements emerged. Often new requirements necessitated gathering,
cleaning and integrating new data from "data marts" that were tailored for ready access by
users.
From this idea, the data warehouse was born as a place where relevant data could be
held for completing strategic reports for management. The key here is the word 'strategic' as
most executives were less concerned with the day to day operations than they were with a
more overall look at the model and business functions. As with all technology, over the
course of the latter half of the 20th century, we saw increased numbers and types of
databases. Many large businesses found themselves with data scattered across multiple
platforms and variations of technology, making it almost impossible for any one individual to
use data from multiple sources. A key idea within data warehousing is to take data from
multiple platforms/technologies (As varied as spreadsheets, DB2 databases, IDMS records,
and VSAM files) and place them in a common location that uses a common querying tool. In
this way operational databases could be held on whatever system was most efficient for the
operational business, while the reporting / strategic information could be held in a common
location using a common language. Data Warehouses take this even a step farther by giving
the data itself commonality by defining what each term means and keeping it standard. (An
example of this would be gender which can be referred to in many ways, but should be
standardized on a data warehouse with one common way of referring to each sex). All of this
was designed to make decision support more readily available and without affecting day to
day operations. One aspect of a data warehouse that should be stressed is that it is NOT a
location for ALL of a businesss data, but rather a location for data that is 'interesting'. Data
that is interesting will assist decision makers in making strategic decisions relative to the
organization's overall mission.
1.2 DEFINITION
The data warehouse makes an attempt to figure out "what we need", before we know we need
it.
Focusing on the modeling and analysis of data for decision makers, not on
daily operations or transaction processing.
But the key of operational data may or may not contain time element.
Operational update of data does not occur in the data warehouse environment.
2. LITERATURE REVIEW
sales data or production data, yet another system for finance and budgeting data etc. In
practice, these systems are often poorly or not at all integrated and simple questions like:
"How much time did sales person A spend on customer C, how much did we sell to
Customer C, was customer C happy with the provided service, Did Customer C pay his bills"
can be very hard to answer, even though the information is available "somewhere" in the
different data systems.
Another problem is that ERP systems are designed to support relevant operations. For
example, a finance system might keep track of every single stamp bought. When it was
ordered, when it was delivered, when it was paid and the system might offer accounting
principles (like double bookkeeping) that further complicates the data model. Such
information is great for the person in charge of buying "stamps" or the accountant trying to
sort out an irregularity, but the CEO is definitely not interested in such detailed information,
the CEO wants to know stuff like "What's the cost?", "What's the revenue?", "did our latest
initiative reduce costs?".
Yet another problem might be that the organization is, internally, in disagreement about
which data is correct. For example, the sales department might have one view of its costs,
while the finance department has another view of that cost. In such cases the organization can
spend unlimited time discussing who's got the correct view of the data. It is partly the purpose
of Data warehousing to bridge such problems. It is important to note that in data warehousing
the source data systems are considered as given: It is not the task of the data warehousing
consultant to figure out, that since the problem is that the CRM system identifies a person by
initials, while the Employee-Time-Management system identifies a person by full name while
the ERP system identifies a person by social security number; and since a person can change
his name: things do not work and the organization should invest in and implement one or two
new systems to handle CRM, ERP etc. in a more consistent manner.
Rather, the data warehousing consultant is charged with making the data appear consistent,
integrated and consolidated despite the problems in the underlying source systems. The data
warehousing consultant achieves this by employing different data warehousing
techniques, creating one or more new data repositories (i.e. the data warehouse) whose
data model(s) support the needed reporting and analysis.
With the increase in business competition, there is a need to obtain and analysis business
data faster. A lot of important business data is in operational database systems. However,
these systems are not designed for business data analysis due to the following reasons:
Data model is normalized for speed and not for data analysis.
There's no cross reference information between data from the different operational
databases, i.e. between financial and operational database.
Historical data may not be found in operational database for trend analysis.
Data warehouse can provide easy information access for business people to increase revenue,
profit, customer satisfaction, saving, and market share. The system can be used for different
departments in the organization.
The development steps for data warehouse project is similar to other information systems.
The following outlines some key steps during developing. The outline is divided into three
sections and they are planning & design, building & testing, roll out & maintenance.
Planning and Design
Business drivers
Objectives
User needs
Application orientation
Data sources
Data quality
Project risk
Budget plan
Time frame
10
Cost benefit analysis Project team composition (DA, DAB, OLAP development, GUI
development, query development, report development, user training, network
management, system integration)
The logical and physical data model design depends on the data access and usage. At the
planning & design phase, the data model is just in preliminary design.
Building and Testing
HW, SW, transformation SW, middle ware, OLAP SW, system management SW
Prototyping
Some data extraction & transformation software are very useful to development data
transformation routines. These tools are very useful for both the construction and the
maintenance phase. In addition, system management software can control the data extraction
processes to extract data from other database systems to the data warehouse.
Roll out & Maintenance
System growth
Performance management
System maintenance
Security
Backup, recovery
Update data
Risk management is important to the success of a data warehouse project. Some of the project
risks are:
Technology risk: i.e. new technology to the market place, new technology to the
organization, and technologies coexist, etc.
11
Complexity risk: i.e. complex data model and database process, business process
change, mission critical requirement, large number of installations, distributed system,
data re-modeling required for legacy system, etc.
Integration risk: i.e. integration with other information system, real time requirements
for the interfaces, etc.
Project team risk: i.e. team member experience, business user involvement, etc.
2.3 ARCHITECTURE
Architecture, in the context of an organization's data warehousing efforts, is a
conceptualization of how the data warehouse is built. There is no right or wrong architecture,
but rather there are multiple architectures that exist to support various environments and
situations. The worthiness of the architecture can be judged from how the conceptualization
aids in the building, maintenance, and usage of the data warehouse.
One possible simple conceptualization of a data warehouse architecture consists of the
following interconnected layers:
Operational Database layer
The source data for the data warehouse is an organization's Enterprise
Resource Planning systems fall into this layer.
Data Access layer
The interface between the operational and informational access layer Tools
to extract, transform, load data into the warehouse fall into this layer.
Metadata layer
The data directory - This is usually more detailed than an operational system
data directory. There are dictionaries for the entire warehouse and sometimes
dictionaries for the data that can be accessed by a particular reporting and analysis
tool.
Informational Access layer
The data accessed for reporting and analyzing and the tools for reporting and
analyzing data Business intelligence tools fall into this layer. And the InmonKimball differences about design methodology, discussed later in this article, have to
do with this layer.
Data warehouse reads source data from different database systems in the organization. The
source databases are usually operational databases. The following is one of the data
warehouse logical architecture:
12
Data warehouse reads data from multiple operational databases. The data is clean,
transformed, or aggregated. The data is either updated or inserted into the data warehouse
depending on the trend analysis requirement. In addition, cross reference data is generated
based on the new data, for example, accounting data from the accounting database needs to
be cross referenced with the facility data from the facility database.
Depending on the data model, the amount of data, and the particular query, performance can
be a problem for a data warehouse system. In the data warehouse, some tables can contain
millions of entries. Query operation to these tables can take a long time. For example, the
query performs aggregate operation to summarize the data. Also, if the query needs many join
operations or sub-queries, the performance will even be slower. These long queries can be
performed over-night in order to minimize the performance impact to the end users. Some of
these long queries can be speeded up by modifying the data model or turning the database.
Scalability is an important consideration in choosing software, hardware, and system
architecture for the data warehouse. Both the database size and the number of users for the
data warehouse can increase substantially over time. The software and hardware must be
scalable to support the new requirements.
13
There are different types of database management system such as relational database system,
object oriented database system, hierarchical database system, etc. Relational database is
usually the choice for implementing data warehouse because the following reasons:
Relational database is the most commonly used database system in the commercial
environment. Many developers already have experience with relational database
products. This reduces the learning curve for the developers.
Most of the operational database system is constructed with relational database. If the
data warehouse is also constructed with relational database, the data conversion
process between the operational database and the data warehouse can be simpler.
Also, there are many database products that enable direct data transfer between
relational database systems.
Relational database is more mature than other types of database system in terms of its
scalability, stability, and efficiency.
Relational database has less proprietary functions than other database systems. This
increases the degree of platform independent.
Object oriented database is sometime used in database application because it has a richer set
of constructs to represent the data model. For example, hierarchical data structure can be
represented well than relational database. Object oriented database provides better integration
between data and functions. Therefore object oriented database is good for application that
has both complex data structure and functions (i.e. CAD application, simulation application).
The data can be extracted into an ASCII report file. The file can be in fix width or in
CSV format. The ASCII report file is generated through standard report function on
the operational database system. In some situation, a custom report function is
developed.
The data can directly be extracted from the source database system. The source
database can create a single database view that contains all the necessary information.
With this database view, the data transformation process can directly request for
information and load the data into the data warehouse.
Data loading process can have errors. The problems can be data referential integrity error,
data format error, data range error, or other data quality errors. In these situations, the source
data has to be modified before it can be loaded into the data warehouse.
Depending on the data source, the source database system may need to be re-modeled in
order to produce the required data for the data warehouse. This data re-modeling work can be
use a lot of time.
15
Data modeling is a creative process and there can be different modeling solution for the same
set of data. The purpose of data modeling is to organize data to meet business objectives and
to provide good performance for database operation.
Meta data is important information in data modeling. It is the information about the data
model. For example "$5.64 sales amount", without metadata, the data shows as "5.64" and we
don't know what it means. Meta data captures business rules for data such as data name,
description, value range, data version, data source, and referential integrity information. The
organization of metadata can be separated into technical level and business level. The
following tables describes the information to be stored in metadata repository.
Technical Level
Business Level
Dependencies
Data security
Data ownership
Meta data can be used as a semantic layer for users to navigate through the data warehouse
without having to understand the complex physical data structure. Some metadata can be
extracted from the database management system or the data modeling case tool.
16
In the above figure, dimension tables are Student table, Instructor table, Course table, and
Semester table. Information in these dimension tables are relatively static over time. Each star
schema can have multiple dimension tables. There is only one fact table for each star schema.
The fact table in figure 3 is the Attendance table. The fact table contains multiple foreign keys
to the four dimension tables. The fact table primary key is the composite of the four foreign
keys. Since the fact table contains transaction type information and the dimension table
contains relatively static information, the amount of data in the fact table is a lot more than
the amount of data in the dimension tables.
The about data model can for example provide the following query result:
The Student table is normalized to contain foreign keys to Major and Minor tables. The
relationship between Student table to Major table is many-to-one. In other situations, if the
relationship is many-to-many, this will create a chain of tables for the dimension table. This
18
makes the data model more difficult to use and understand by the end-user. Therefore, the use
of snow flake schema can decrease the browsing performance.
In addition, the storage space saving in the dimension table is not significant in comparison to
the size of the fact table. Fact table is usually many times larger than the dimension table.
Employee
View A
View B
First Name
Last Name
Department
Position
Phone Number
Age
Salary
Employee table has both View A and View B. View A can access all attributes in the
Employee table. View B can access all attributes except for attribute Age and Salary. View A
is used by manager in the company. View B is used by all other users.
19
20
21
CONCLUSION
Since the primary task of management is effective decision making, the primary task of
research, and subsequently data warehouses, is to generate accurate information for use in
that decision making.
It is imperative that an organizations data warehousing strategies reflect
changes in the internal and external business environment in addition to the direction in
which the business is traveling. Data warehouse is a good solution for storing and analyzing
large amount of data. It reads data from multiple operational databases on an ongoing basis.
Cross reference information is generated between the data from the different databases. The
data model is designed to provide good browsing performance to the end user. Data
warehouse can be seen as a centralized data repository to provide both current and historical
data to the end user.
Playing an integral role in the growth, development and success of an organization, data
Warehouses facilitate meaningful research which facilitates effective management.
22
APPENDICES
OLAP: Online Analytical Processing
DA: Data Extraction
CEO: Chief Executive Officer
DW: Data Warehouse
HW: Hardware
SW: Software
GUI: Graphical User Interface
, page 11
, page 11
, page 10
, page 9
, page 11
, page 11
, page 11
REFERENCES
DCI, (1997). The Roadmap for Data Warehouse Implementation.
Kimball, (1996). Data Warehouse Toolkit. John Wiley & Sons, Inc.
Inmon, W.H. Tech Topic: What is a Data Warehouse? Prism Solutions. Volume 1.
Wikipedia
Tutorialspoint.com
23