Académique Documents
Professionnel Documents
Culture Documents
Abstract: The contents gives the basic concepts of data warehousing and
data mining emphasizing both the technical and managerial issues and the
implication of these modern emerging technologies on those issues.
The never-ending cycle of data generation followed by the increased need to store
that data has been a challenge faced by information systems professionals for
decades. In addition, we are becoming increasingly aware of the hidden treasure
trove of new knowledge quietly residing in our data and face considerable
frustration when we attempt to get at it. This constant cycle of data generation,
storage, and difficulty in retrieval and analysis has resulted in the development of
new and powerful tools to assist us in meeting this challenge.
These new and powerful tools are data warehouse and data mining.
From this definition, we can find what exactly a data warehouse is. First, it is a
database containing a copy of transaction data. Because the data warehouse
contains a copy of transaction data rather than the actual record generated by the
original transaction, the data warehouse data is not subjected to update or change
once it is committed to the data warehouse. This suggests that the data in a data
warehouse is static and, once entered, remains unaltered in any shape or form.
Hence data warehouses generally do not get smaller, but rather grows to enormous
proportions.
Second, the copy of transaction data is specifically structured. Thus data committed
to the data warehouse is transformed to conform to a specific structure such that
transactional data from a variety of sources can reside in the data warehouse using
this specified structure.
Finally, our definition tells us what the purpose of the data warehouse is – querying,
analysis, and reporting. The data warehouse becomes a central repository for all
organizational data deemed useful for the exploration of new relationships, trends
and hidden values.
The data mart is nothing more than a smaller, more focused data warehouse. In
many cases, organizations find it useful to create data marts for specific business
units that have equally specific data analysis needs. Although the larger data
warehouse could support those needs, the enormous bulk of data contained within
a typical data warehouse could reduce the efficiency of a consistently focused data
analysis effort.
Data Acquisition
Data Data
Warehouse Delivery
Enterprise Data
Management
Metadata Integration/transformation
Summarized Data programs
Archives
Current Detail Operational systems of
record
The heart of a data warehouse is its current detail, where the bulk of data
resides. Current detail comes directly from operational systems and may be
stored as raw data or as aggregations of raw data. Current detail, organized by
subject area, represents the entire enterprise, rather than a given application.
Current detail is the lowest level of data granularity (size of unit represented by
data) in the data warehouse. Every data entity in current detail is a snapshot, at
a moment in time, representing the instance when the data are accurate.
A system of record is the source of the data that feed the data warehouse. Data
in a data warehouse differ from operational systems data in that they can only
be read, not modified. Thus, it is necessary that a data warehouse be populated
with the highest quality data available, i.e., data that are most timely, complete,
accurate, and have the best structural conformance to the data warehouse.
Often these data are closest to the source of entry.
Even the highest quality operational data cannot be usually copied, as it is, into
a data warehouse. Raw operational data are virtually unintelligible to most
information users. Operational data seldom conform to the logical, subject-
oriented structure of data warehouse. Further, different operational systems
represent data differently, use different codes for the same thing. And most
operational data may be stored and managed redundantly and may even reside
in many different physical sources. Hence these operational data must be
cleaned up, edited and reformatted before being loaded into a data warehouse.
As operational data items pass from their systems of record to a data
warehouse, integration and transformation programs convert them from
application-specific data into enterprise data.
4.5 Archives
Data warehouse archives contain old data of significant, continuing interest and
value to the enterprise. There is usually a massive amount of data stored in the
data warehouse archives, with low incidence of access. Archive data are most
often used for forecasting and trend analysis. Although archive data may be
stored with the same level of granularity as current detail, it is more likely that
archive data are aggregated as they are archived. Archives include not only old
data; they also include the metadata that describes the old data’s
characteristics.
4.6 Metadata
1. HP – Intelligent Warehouse
2. IBM – DB2 Database Server, Enterprise Copy Manager
3. Microsoft – Microsoft SQL Server
4. NCR – Teradata
5. Oracle – Oracle8, Discoverer/2000
6. Siemens-Pyramid – Smart Warehouse
In the early 1960’s, data mining was referred to as ‘statistical analysis’. During
this period, the pioneers of statistical analysis were SAS, SPSS, and IBM.
Originally statistical analysis consisted of classical statistical routines such as
correlation, regression, chi – square, and cross tabulation.
Although all data mining endeavors are unique to their respective analyst and
problem under study, each possesses a set of commonalities with regard to the
process steps necessary to achieve a successful outcome. The steps are as
follows:
1. Infrastructure Preparation
2. Exploration
3. Analysis
4. Interpretation
5. Exploitation
6.1 Infrastructure Preparation
The first step in data mining is the identification and preparation of the
infrastructure. It is in the infrastructure that the actual data mining activity will
occur. The infrastructure contains at least:
A hardware platform
Database Management System (DBMS) platform
One or more tools for data mining
6.2 Exploration
6.3 Analysis
Once a set of patterns has been discovered, each pattern must be analyzed.
Some patterns will not be statistically strong, whereas others may display high
statistical strength (i.e. significance). On the other hand, if a pattern is not strong
today, but its strength increases over time, the pattern may be of great interest.
The next step is to interpret the patterns without which the discovered patterns
would be useless. In order to interpret the patterns, it is necessary to combine
technical and business expertise. Some considerations in interpreting the
patterns include:
6.5 Exploitation
7.0 REFERENCES