Académique Documents
Professionnel Documents
Culture Documents
gathers some raw inputs (data from TPS and other places),
operates on those inputs (to cleanse, integrate, and store the data), and then
distributes the results to users.
Extract data from a variety of sources. Most warehouses gather data from a
number of source systems.
Integrate data into a common repository. Once the data is extracted from its
source, it cant just be thrown into database tables. To be useful, the data has to be
cleansed, and the relationships between data elements must be validated and
enforced.
Put data into a format that users can use. The data warehouse must deliver its
product in a standard, user-friendly format.
Provide users with query tools to access the warehouse. To support query need,
the information utility must supply tools that allow users to plug into the
warehouse.
Data Stores
i.
ii.
iii.
iv.
The source system the transaction processing systems that will provide
data to the warehouse.
A warehouse or integration layer.
The data mart or high performance query structure (HPQS), and
The data on the report or analysis in the end users hand.
Data Flows
i.
ii.
iii.
From warehouse sources into integration layer, where data is cleansed and
integrated.
From integration layer data flows to the HPQS.
From HPQS to the end user via reporting application.
These two architectures can be quite similar or remarkably different. For example, while
refresh loads usually look at online data, history loads frequently have to dredge up to lot
of history from offline storage.
In addition, while refresh loads must determine which source records have changed to
extract just those, history loads frequently just bring it all over.
Finally, with regard to table design, we are firm believers that all records put into
warehouse should contain a number of time stamps.
How Will I Determine What Records to Extract?
The art of determining what records to extract from the source system is frequently called
change data capture (CDC).
The point of change data is to recognize what source records have changed and how, so
that just the changed records are moved to the warehouse.
Techniques used to recognize changes to source database tables are:
Timestamps
Triggers
Application Integration Software (AIS)
File Compares
Timestamps
Timestamp records whenever the data is inserted or deleted in the source systems. In
these situations, change data capture is reduced to an exercise of a search through tables
to determine what records have changed.
In addition, the source system doesnt actually allow deletes, but, instead, marks record a
deleted and timestamps the delete time without actually removing them from the
database.
Triggers
A great technique for capturing changes in source records is to put triggers on the source
tables.
Every time a record is inserted into, updated in, or deleted from a source table, the
triggers write a corresponding message in a log file. The warehouse uses the information
in these log files to determine how to update itself.
In practice, though, it is unusual to see this method implemented, because it requires you
to put triggers on, or modify, your source systems a step that many organizations simply
will not allow. Their (some times valid) concern is that the addition of triggers will
jeopardize the performance of those source systems. This performance drop is in
comparison to other warehouse loading techniques that touch the sources in a batch mode
during off-peak times.
Also flat files do not support triggers.
Application Integration Software (AIS)
AIS tools are used to pass information between applications. Tools in this field have
names like MQ Series, Mercator and Tibco.
For example, say your company utilizes Oracle software. AIS can be used to link these
applications, such that when transaction occurs in one, the AIS would transmit it to all the
others. AIS provides on additional benefit as a data feed.
File Compares
Probably the least desirable technique for identifying changes in your source data is to
compare the file as it appears today to a copy of how it appeared when you last loaded the
warehouse.
Not only is this technique difficult, to implement, but its also less accurate than some
other methods. How so? Well, this technique compares periodic snapshots. Thus if you
load your warehouse weekly, you will only be able to see the new state of the database
every week, but not every change that occurred during the week.
How Will I Format The Extracted Records?
Once youve extracted the records, you should store them in a way that will allow you to
recognize what each means. Carry enough reference data to tie each record back to its
source.
For example, make sure you keep information in the source system that indicates what
source system generated the record, when the record was obtained, and the key of the
record. This information is invaluable when testing your load routines as well as when
you need to investigate detail.
What Will I Do With The Extracted Records?
Generally, extracted records are stored in flat files that are then read by data loading
programs. Data loading programs will periodically read these files and load the data into
the warehouse as appropriate.
In general we believe in building loose coupled warehousing architectures. By loose
coupling, we mean maintaining a separation between data extraction programs and data
loading programs. It makes your warehouse more flexible and maintainable.
One nice benefit to loose coupling is that it eases the addition of new data sources. So
long as the new source systems submit their data in the standard file format, your
standard load routines should be able to read that new data.
Dirty Data
Data received from source systems can be dirty in a number of ways.
Format violations
Referential integrity violations
Cross-system matching violations and
Internal consistency violations.
Format Violations
Data types are potentially wrong. For example, you might find letters in supposedly
numeric fields, incorrectly formatted phone numbers, and other similar examples.
Referential Integrity Violations
Data is not referentially sound. The sales system, for instance, might record sales to
customers who arent listed in the customer file.
Cross-system Matching Violations
The same data elements appear in multiple systems but cannot be easily matched to each
other. This happens when a customer appears as J Smith in the sales system and John
Smith in the accounts receivable database.
Internal consistency Violations
The same records are repeated in a single table, typically with minor differences such as
with different spelling of names and other fields. For example, the customer IBM might
appear multiple times in your customer database: once as IBM, another as International
Business machines and so on.
The problem with dirty data is that it makes your warehouse unreliable. This sort of data
should be cleansed whenever the data is loaded into the warehouse.
Data Store 2 The Integration Layer
The integration layer, or warehouse, is a normalized database that unites the feeds from
all your sources in a single place.
As in all databases, it is strongly recommended that referential integrity constraints be
enabled in your warehouse/integration layer.
Why Build an Integration Layer?
Building an integration layer
If, on the other hand, you build an integration layer, every mart has to read
only one source: the integration layer that already contains integrated,
clean data.
In the above table, the monthly bill information is a repeating group. To put our data into
first normal form (sometimes referred to as 1NF), we must create a new table as
following.
CUSTOMER
cust_num
cust_name
cust_addr
cust phone
Substation_id
Substation_name
CUSTOMER_MONTH
cust_num
Month_id
Month_name
Second
Normal Form
Month_bill_amount
To be in second normal form, all non-key attributes of a table
must rely on the entire key of the table.
CUSTOMER
CUSTOMER_MONTH
cust_num
Month_id
Month_bill_amount
MONTH
Month_id
Month_name
CUSTOMER_MONTH
cust_num
cust_name
cust_addr
cust phone
Substation_id
cust_num
Month_id
Month_bill_amount
MONTH
Month_id
Month_name
SUBSTATION
Substation_id
Substation_name
Fname
Street
City
State
Georgia USA
480BC
Themistocles Bob
46 Maple Lane
Athens
550BC
Cyrus
74 Tech Way
Anshan
Hubert
Country Planet
Persia
Earth
Earth
The same records after they have been placed into a warehouse table on adding the
surrogate key Custkey.
Custkey Custnum Lname
Fname
Street
City
State
Georgia USA
480BC
Themistocles Bob
46 Maple Lane
Athens
550BC
Cyrus
74 Tech Way
Anshan
Hubert
Country Planet
Persia
Earth
Earth
The following is the warehouse table after Themistocles moved from Athens to Argos.
Custkey Custnum Lname
Fname
Street
City
State
Country Planet
Georgia USA
480BC
Themistocles Bob
46 Maple Lane
Athens
550BC
Cyrus
74 Tech Way
Anshan
Persia
480BC
Themistocles Bob
Ostracon Place
Argos
Greece Earth
Hubert
Earth
Earth
Additional fields we found to be useful on warehouse records include: insert date, last
update date, status flag (for example, is this the current record or has it been superceded
by some other?), and the source system responsible for this record.
Another Note About Dirty Data
Techniques for handling bad records include:
Ignore them
Rejecting bad records, but saving them in a separate file for manual review.
Loading as much of the bad record as possible and pointing out the errors for later
review.
Data Flow 2 From the Integrity Layer to the High Performance Query Structure
Data warehousing is an end user concept. End users should query data only out of their
high performance query structures (HPQS), or data marts.
In this flow, data is extracted from integration layer, or warehouse, and inserted into the
data marts. Thus, you must build another set of extract, transformation, and load (ETL)
jobs to populate marts. Once again you generally should try to incrementally refresh your
data mart rather than completely refreshing them with every load.
One way in which your data marts will differ from data warehouse is in their use of
summary tables. As they can greatly improve end user query performance, data marts
almost always contain some summaries of their atomic-level details. Thus, your load
programs may be called upon to load not only the atomic-level data, but also these
summary tables.
Summary management is greatly aided by Oracles 8is materialized view functionality.
Oracle 8i allows you to create summary tables called materialized views.
Data Store 3 The High Performance Query Structure (HPQS)
HPQSs, or data marts, are databases and data structures setup specifically to support enduser queries.
These databases are most frequently managed by, either
Relational database engines (for example, Oracle 8i) or
Multi dimensional database engines (for example, Express or Essbase.)
Its important to note that data marts and HPQSs are logical, not physical, concepts.
Frequently, an organizations data warehouse and its data mart will share the same
computer. Some times, they even share the same Oracle instance and schema. Still they
have different purposes and physically have very different table designs.
In the end, your data mart is a set of data structures that contains data in formats that
make it easier and speedier for end users to access than it would be in traditional,
normalized database formats.
Data Flow 3 From HPQSs to the End User Reporting Applications
The query tools get data out of the data marts and into the hands of end users. These tools
generally issue SQL calls to relational databases and other appropriate calls to databases
using other technologies. Data is returned to the tools, which then formats the result.
Data Store 4 Data in the End Users Hand
End users reports really do constitute a data store. If your sponsors cant be made
comfortable with that concept, you may have problem on your hand.
Alternate Warehousing Architectures
No Warehouse Users could query directly off the OLTP sources. This is possible
when the transaction processing systems are sufficiently strong and end-user
queries are sufficiently limited, such that there are no concerns about response
time. This approach, obviously, wont relieve situations where there is a need to
integrate data across multiple systems.
Normalized Design The architecture weve laid out here is to build just
warehouse/integration layer. All data is integrated there, but rather than querying
from denormalized or multidimensional data marts, users query directly out of the
integration layer. This architecture will provide you with the integration benefits
of our preferred architecture. What it wont give you are the usability and query
performance benefits associated with HPQSs or data marts.
Just Data Marts This approach is best used for limited scope, point solutions that
dont need data integrated from multiple systems. For example, if a department
has a need for data out of a single OLTP system and this type of data has very
little application in the rest of the company perhaps it doesnt make sense to bring
this data through the warehouse.