Vous êtes sur la page 1sur 8

Data Warehousing and Data Mining

Bijith Varghese Skariah

School of Management Studies


CUSAT, Kochi – 22
E-mail: bijith.vs@gmail.com

Abstract: The contents gives the basic concepts of data warehousing and
data mining emphasizing both the technical and managerial issues and the
implication of these modern emerging technologies on those issues.

Keywords: Data Warehouse, Data Mart, Data Warehouse Architecture,


Metadata
1.0 INTRODUCTION

The never-ending cycle of data generation followed by the increased need to store
that data has been a challenge faced by information systems professionals for
decades. In addition, we are becoming increasingly aware of the hidden treasure
trove of new knowledge quietly residing in our data and face considerable
frustration when we attempt to get at it. This constant cycle of data generation,
storage, and difficulty in retrieval and analysis has resulted in the development of
new and powerful tools to assist us in meeting this challenge.

These new and powerful tools are data warehouse and data mining.

2.0 THE MODERN DATA WAREHOUSE

A data warehouse is a copy of transaction data specifically structured for querying,


analysis, and reporting.

From this definition, we can find what exactly a data warehouse is. First, it is a
database containing a copy of transaction data. Because the data warehouse
contains a copy of transaction data rather than the actual record generated by the
original transaction, the data warehouse data is not subjected to update or change
once it is committed to the data warehouse. This suggests that the data in a data
warehouse is static and, once entered, remains unaltered in any shape or form.
Hence data warehouses generally do not get smaller, but rather grows to enormous
proportions.

Second, the copy of transaction data is specifically structured. Thus data committed
to the data warehouse is transformed to conform to a specific structure such that
transactional data from a variety of sources can reside in the data warehouse using
this specified structure.

Finally, our definition tells us what the purpose of the data warehouse is – querying,
analysis, and reporting. The data warehouse becomes a central repository for all
organizational data deemed useful for the exploration of new relationships, trends
and hidden values.

2.1 Primary Functions of Data Warehouse

1. It is a direct reflection of the various business rules of the enterprise


2. It is the collection point for the integrated, subject-oriented strategic
information.
3. It is the historical store of strategic information.
4. It is the source of information that is subsequently delivered to the data marts.
5. It is the source of stable data.

2.2 Advantages of Data Warehouse

1. Immediate information delivery.


2. Data integration from across and even outside the organization.
3. Future vision from historical trends.
4. It provides its users with tools for looking at and manipulating data in different
ways.
5. It is more user-friendly than other databases.

2.3 Disadvantages of Data Warehouse

1. It cannot create data of its own.


2. It can identify where data problems exist but cannot correct them. The
correction should be made at the source of the data.

3.0 DATA MARTS

The data mart is nothing more than a smaller, more focused data warehouse. In
many cases, organizations find it useful to create data marts for specific business
units that have equally specific data analysis needs. Although the larger data
warehouse could support those needs, the enormous bulk of data contained within
a typical data warehouse could reduce the efficiency of a consistently focused data
analysis effort.

3.1 Primary Functions of Data Mart

1. It is a reflection of the business rules of a specific function or business unit.


2. It obtains its data from the organizational data warehouse.

Data Acquisition
Data Data
Warehouse Delivery

Enterprise Data
Management

Fig. 1: Position of the Data Warehouse Within the Organization


Fig. 2: Data Warehouse

4.0 DATA WAREHOUSE ARCHITECTURE

The major components of a data warehouse are:


 Summarized data
 Operational systems of record
 Integration / Transformation programs
 Current detail
 Data warehouse architecture or metadata
 Archives

Metadata Integration/transformation
Summarized Data programs

Archives
Current Detail Operational systems of
record

Fig. 3: Components of Data Warehouse


4.1 Summarized Data

Summarized data is classified into two – lightly summarized and highly


summarized. Lightly summarized data are the hallmark of a data warehouse. All
enterprise elements (department, region, function, etc.) do not have the same
information requirements, so effective data warehouse design provides for
customized, lightly summarized data for every enterprise element. Highly
summarized data are primarily for enterprise executives. It can come from either
lightly summarized data or from current detail.

4.2 Current Detail

The heart of a data warehouse is its current detail, where the bulk of data
resides. Current detail comes directly from operational systems and may be
stored as raw data or as aggregations of raw data. Current detail, organized by
subject area, represents the entire enterprise, rather than a given application.
Current detail is the lowest level of data granularity (size of unit represented by
data) in the data warehouse. Every data entity in current detail is a snapshot, at
a moment in time, representing the instance when the data are accurate.

4.3 System of Record

A system of record is the source of the data that feed the data warehouse. Data
in a data warehouse differ from operational systems data in that they can only
be read, not modified. Thus, it is necessary that a data warehouse be populated
with the highest quality data available, i.e., data that are most timely, complete,
accurate, and have the best structural conformance to the data warehouse.
Often these data are closest to the source of entry.

4.4 Integration and Transformation Programs

Even the highest quality operational data cannot be usually copied, as it is, into
a data warehouse. Raw operational data are virtually unintelligible to most
information users. Operational data seldom conform to the logical, subject-
oriented structure of data warehouse. Further, different operational systems
represent data differently, use different codes for the same thing. And most
operational data may be stored and managed redundantly and may even reside
in many different physical sources. Hence these operational data must be
cleaned up, edited and reformatted before being loaded into a data warehouse.
As operational data items pass from their systems of record to a data
warehouse, integration and transformation programs convert them from
application-specific data into enterprise data.

4.5 Archives

Data warehouse archives contain old data of significant, continuing interest and
value to the enterprise. There is usually a massive amount of data stored in the
data warehouse archives, with low incidence of access. Archive data are most
often used for forecasting and trend analysis. Although archive data may be
stored with the same level of granularity as current detail, it is more likely that
archive data are aggregated as they are archived. Archives include not only old
data; they also include the metadata that describes the old data’s
characteristics.

4.6 Metadata

It is required that that a separate data definition language is implemented, which


provides a meaningful description of the information contents. This usually is
known as metadata – literally data about data.

4.7 Leading Data Warehouse Vendors and Products

1. HP – Intelligent Warehouse
2. IBM – DB2 Database Server, Enterprise Copy Manager
3. Microsoft – Microsoft SQL Server
4. NCR – Teradata
5. Oracle – Oracle8, Discoverer/2000
6. Siemens-Pyramid – Smart Warehouse

5.0 DATA MINING

Data mining is a collection of powerful data analysis techniques intended to


assist in analyzing extremely large datasets. Properly applied, data mining can
reveal hidden relationships and information buried within the organizational data
warehouse. There is no one data mining approach, but rather a set of
techniques that often can be used in combination with each other to extract the
most insight from a set of data.

In the early 1960’s, data mining was referred to as ‘statistical analysis’. During
this period, the pioneers of statistical analysis were SAS, SPSS, and IBM.
Originally statistical analysis consisted of classical statistical routines such as
correlation, regression, chi – square, and cross tabulation.

By late 1980’s, classical statistical analysis was augmented by a more powerful


set of techniques with names such as fuzzy logic, heuristic reasoning, and
neural networks. This was the heyday of artificial intelligence (AI).

6.0 GENERAL APPROACH IN DATA MINING

Although all data mining endeavors are unique to their respective analyst and
problem under study, each possesses a set of commonalities with regard to the
process steps necessary to achieve a successful outcome. The steps are as
follows:

1. Infrastructure Preparation
2. Exploration
3. Analysis
4. Interpretation
5. Exploitation
6.1 Infrastructure Preparation

The first step in data mining is the identification and preparation of the
infrastructure. It is in the infrastructure that the actual data mining activity will
occur. The infrastructure contains at least:

 A hardware platform
 Database Management System (DBMS) platform
 One or more tools for data mining

The hardware platform is a separate platform than that which originally


contained the data. For proper data mining, data must be removed from its host
environment and prepared by undergoing a thorough analysis like ‘scrubbing’.
Here integration of data takes place since it must have come from multiple
sources. In addition, a metadata is also constructed. Metadata is simply data
about data: Where it came from, how old is it, how it was captured, what unit it
represents, etc.

6.2 Exploration

Next the data exploration commences. Some of the approaches to the


discovery of important relationships include:

 Analyzing summary data and “sniffing out” unusual occurrences and


patterns
 Sampling data and analyzing the samples to discover patterns that have
not been detected before
 Conducting simple random accesses to the data
 Heuristically searching for patterns

6.3 Analysis

Once a set of patterns has been discovered, each pattern must be analyzed.
Some patterns will not be statistically strong, whereas others may display high
statistical strength (i.e. significance). On the other hand, if a pattern is not strong
today, but its strength increases over time, the pattern may be of great interest.

A second consideration to be taken into account is whether the pattern


potentially represents a false positive, where a correlation between two or more
variables is statistically significant but completely random and meaningless.

A third consideration is whether a valid correlation of variables has any business


significance. A valid correlation between two variables that is not a false positive
does not mean that there is real business significance. There are chances that it
will be irrelevant for the current business environment but may be relevant in
future.
6.4 Interpretation

The next step is to interpret the patterns without which the discovered patterns
would be useless. In order to interpret the patterns, it is necessary to combine
technical and business expertise. Some considerations in interpreting the
patterns include:

 The larger business cycles of the business


 The seasonality of the business
 The population to which the pattern is applicable
 The strength of the pattern and the ability to use the pattern as a basis for
future behavior
 The size of the population the pattern applies to, etc.

6.5 Exploitation

Exploitation of a discovered pattern is both a business and technical activity.


The easiest way that a discovered pattern can be exploited is to use the pattern
as predictor of future behavior. Once the behavior pattern is determined for a
segment of the population served by a company, the pattern can be used as a
basis for prediction. Once the population is identified and the conditions under
which the behavior will predictably occur are defined, the business is now in a
position to exploit the information. Some of the ways in an organization can
exploit new patterns in the data are:

 Specific sales offers


 Packaging products to appeal to the predicted audience
 Introducing new products
 Pricing products in an unusual way
 Advertising to appeal to the predicted audience
 Delivering services and/or products creatively
 Presenting products and services to cater to the predicted audience

7.0 REFERENCES

1. George M. Marakas, 2003, “Modern Data Warehousing, Mining, and Visualization”,


Pearson Education (Singapore) Pte. Ltd., First Indian Reprint, India.
2. Alexis Leon & Mathew Leon, 2009, “Fundamentals of Information Technology”,
Second Edition, India.
3. “Data Warehousing”, http://en.wikipedia.org/wiki/Data_warehouse, downloaded on
14-09-2009.
4. Michael Reed, http://www.intranetjournal.com/features/datawarehousing.html, downloaded
on 14-09-2009
5. “Data Mining”, http://en.wikipedia.org/wiki/Data_mining, downloaded on 14-09-
2009.
6. Anonymous, 2009, http://www.dwinfocenter.org, downloaded on 14-09-2009.

Vous aimerez peut-être aussi