Vous êtes sur la page 1sur 27

Anwendersoftware as

Anwendungssoftware

Data-Warehouse-, Data-Mining- und


OLAP-Technologien

Data Warehouse Architecture


Data Warehouse Architecture Anwendersoftware

Overview

• Data Warehouse Architecture


• Data Sources and Data Quality
• Data Mart
• Federated Information Systems
• Operational Data Store
• Metadata
ƒ Metadata Repository
ƒ Metadata in Data Warehousing

2
Data Warehouse Architecture Anwendersoftware

Architecture
End User data flow
Data Access control flow

Data
Warehouse
Data Warehouse Manager Metadata Manager Metadata
Repository

Load

Data Staging Transformation


Area

Extraction Monitor
Data Staging Area
Data Warehouse System

Data Sources

(A. Bauer, H. Günzel: Data Warehouse Systeme, 2001) 3


Data Warehouse Architecture Anwendersoftware

Data Sources

• Characteristics of source systems: • Important issues in selecting data


ƒ narrow, "account-based" queries sources:
ƒ no queries in a broad and ƒ Purpose of the data warehouse
unexpected way, like DW ƒ Quality of data sources
ƒ maintain little historical data (consistency, correctness,
ƒ no conformed dimensions completeness, exactness,
(product, customer, geography, reliability, understandability,
…) with other legacy systems relevance)
ƒ use keys (production keys) to ƒ Availability of data sources
make certain things unique (organizational prerequisites,
(product, customer, …) technical prerequisites)
ƒ Costs
(internal data, external data)

Data Sources

4
Data Warehouse Architecture Anwendersoftware

Data Quality
consistency • Are there contradictions in data and/or
metadata?
correctness • Do data and metadata provide an exact picture
of the reality?
completeness • Are there missing attributes or values?

exactness • Are exact numeric values available?


Are different objects identifiable? Homonyms?
reliability • Is there a Standard Operating Procedure (SOP)
that describes the provision of source data?
understandability • Does a description for the data and coded
values exist?
relevance • Does the data contribute to the purpose of the
data warehouse?

5
Data Warehouse Architecture Anwendersoftware

Dimensions of Data Sources


origin • internal vs. external data

time • current vs. historic data

usage • data vs. metadata

type • number, string, time, graphic, audio, video, …


numeric, alphanumeric, boolean, binary, …
character set • ASCII, EBCDIC, UNICODE, …

orientation • left to right, right to left, top-down

confidentiality • strictly confidential, confidential, public, …

6
Data Warehouse Architecture Anwendersoftware

Monitoring

• Goal: Discover changes in data source incrementally


• Approaches:

Based on … Changes identified by …


Trigger triggers defined in trigger writes a copy of
source DBMS changed data to files
Replica replication support of replication provides changed
source DBMS rows in a separate table
Timestamp timestamp assigned to use timestamp to identify
each row changes (supported by
temporal DBMS)
Log log of source DBMS read log
Snapshot periodic snapshot of data compare snapshots
source

7
Data Warehouse Architecture Anwendersoftware

Data Staging Area (DSA)

Data Staging Area: • Data is temporarily stored in the


A storage area and a set of data staging area before it is
processes that clean, transform, loaded into the data warehouse.
combine, deduplicate, household, • All transformations are performed
archive, and prepare source data for in the DSA.
use in the data warehouse. ƒ Preprocessing does not influence
(R. Kimball et al: The Data Warehouse Lifecycle Toolkit, 1998) data sources or data warehouse
• DSA is the central repository for
Load ETL (Extraction - Transformation
- Load) processing.

Data Staging Transformation


Area

Extraction Monitor

8
Data Warehouse Architecture Anwendersoftware

Extraction

• Transfer data from data source into the data staging area.
• Extracted subset of data sources and schedule of the extraction
depends on the kind of analysis that should be supported.
• Method depends on the monitoring strategy used:
ƒ Read data from a file written by triggers.
ƒ Read data from replication tables.
Data Staging
ƒ Select data based on the timestamp. Area

ƒ Read data from log.


ƒ Read output of snapshot comparison. Extraction
• Multiple extract types:
ƒ periodic
Data Sources
ƒ started by the admin/user
ƒ event-driven
ƒ immediate after changes in data sources
9
Data Warehouse Architecture Anwendersoftware

Transformation

• Convert the data into something representable to the users and


valuable to the business.
ƒ Transformation of structure and content
• Typical transformations:
ƒ denormalization, normalization Data Staging Transformation
Area
ƒ data type conversion
ƒ calculation, aggregation
ƒ standardization of strings and date values
ƒ conversion of measures
ƒ cleansing (missing, wrong, and inconsistent values)

10
Data Warehouse Architecture Anwendersoftware

Load

• Transfer data from the data staging area into the data warehouse.
• Data in the warehouse is rarely replaced. The history of
values/changes is stored instead.
• Mainly based on bulk load tools of the DBMS.
• Offline vs. online load. Data
Warehouse
• Parallel load may be required.

Load

Data Staging
Area

11
Data Warehouse Architecture Anwendersoftware

Data Warehouse Manager

• Controls all components of the data warehouse system:


ƒ Monitor: Discover changes in data sources
ƒ Extraction: Select and transfer data from data sources to the data
staging area
ƒ Transformation: Consolidate data
ƒ Load: Transfer data from data staging area to the data warehouse
ƒ End User Data Access: Analysis of data in the data warehouse

12
Data Warehouse Architecture Anwendersoftware

Basic Elements of the Data Warehouse

Source Systems Data Staging Area "The Data Warehouse" End User
Presentation Servers Data Access

Storage:
Storage: Data
DataMart
Mart#1:
#1: Ad
AdHoc
HocQuery
Query
feed
Flat
Flatfiles;
files; OLAP query
OLAP query Tools
Tools
extract
RDBMS;
RDBMS; services;
services;
other populate,
other replicate, dimensional!
dimensional! Report
Processing:
Processing: recover subject
subjectoriented;
oriented; Report
clean; locally Writers
clean; locallyimplemented;
implemented; feed Writers
prune;
prune; user
usergroup
groupdriven;
driven;
extract
combine;
combine; may
may storeatomic
store atomic
remove
removeduplicates;
duplicates; data;
data; End
EndUser
User
household;
household; may
maybe befrequently
frequently Applications
Applications
standardize;
standardize; refreshed;
refreshed; feed
conform
conformdimensions;
dimensions; conforms
conformstotoDWDWBus
Bus
extract store
storeawaiting
awaitingreplication;
replication; Models
archive; Models
archive; DW BUS Conformed dimensions

export
Conformed facts forecasting;
forecasting;scoring;
exporttotodata
datamarts;
marts; populate, scoring;
allocating;
allocating;
replicate, Data
DataMart
Mart#2:
#2: feed data
datamining;
mining;
No
Nouser
userquery
queryservices recover other
services otherdownstream
downstream
systems;
systems;
DW BUS Conformed dimensions
other
Conformed facts otherparameters;
parameters;
special
specialUI
UI
populate,
replicate, Data
DataMart
Mart#3:
#3:
recover

upload cleaned dimensions upload model results

(R. Kimball, et al.: The Data Warehouse Lifecycle Toolkit, 1998)


13
Data Warehouse Architecture Anwendersoftware

Architecture

Clients

Data Logical Data Data


Warehouse Warehouse Warehouse

Central architecture Federal architecture Tiered architecture

• only one data • logically • physical central data


model consolidated warehouse
• performance • separate physical • separate physical
bottleneck databases that store databases that store
• complex to build detailled data summarized data
• easy to maintain • faster response time • faster response time

(M. Jarke et al., Fundamentals of Data Warehouses, 2002)


14
Data Warehouse Architecture Anwendersoftware

Data Marts
End User End User
Data Access End User Data Access End User
End User End User Data Access Data Access
Data Access End User Data Access End User
Data Access Data Access

Data Data Data Data


Mart Mart Mart Warehouse

Transformation

Data
Warehouse
Data Data Data Data
Mart Mart Mart Mart

Load Load Load Load

dependent data marts independent data marts


15
Data Warehouse Architecture Anwendersoftware

Data Marts
dependent data marts independent data marts
(tiered architecture) (federated architecture)

• Central data warehouse (DW) is • Several data marts (DM) are build
build first first
• Extracts of the data warehouse • Data marts are integrated by
are provided as data marts means of a second
(materialized views) transformation step
• Establish ETL process for DW • Establish ETL process for each
only DM and the central DW
• Consistent analysis on DW and • Inconsistent analysis is possible
DM
• Virtual data warehouse possible
(federated architecture)

16
Data Warehouse Architecture Anwendersoftware

Federated Information Systems

• Federated DBMS ƒ Complete, extensible database


ƒ Transparent access to a engine
collection of heterogeneous and - Function compensation
semi-automonomous data - Powerful (global) query optimizer
sources. (pushdown analysis, cost-based
optimization,
query rewrite)
Presentation layer

Federation layer
e.g. uniform access language, uniform access schema, uniform metadata set

Local
Applications Foundation layer
(data sources)
17
Data Warehouse Architecture Anwendersoftware

Architecture for a Federated Database


Server

Wrapper Back-end
SQL API Data
Data Source
Federated
Database
Server
Client Back-end
Data Source Data

Catalog Data

18
Data Warehouse Architecture Anwendersoftware

Federated DBMS: Processing Scenario

Oracle
Federated DB SELECT Ename, Dno
FROM EMP
Ename &
Dname
SELECT Ename, Dname WHERE Floor = 2
FROM EMP E, DEPT D ORDER BY Dno
WHERE E.Dno = D.Dno
AND E.floor = 2
AND D.Mgr = 'Cooke' DB2
SELECT Dname, Dno
FROM Dept
WHERE Mgr = 'Cooke'
ORDER BY Dno
• Knowing what the data source
can do is a good idea!

19
Data Warehouse Architecture Anwendersoftware

Operational Data Store (ODS)

• Term has taken many definitions. For example:


ƒ Point of integration for operational systems
- refreshed within a few seconds after the operational data sources are
updated
- very little transformations are performed
- Example: Banking environment where data sources keep individual
accounts of a large multinational customer, and the ODS stores the total
balance for this customer.
true operational system separated from the data warehouse

ƒ Decision support access to operational data


- integrated and transformed data are first accumulated and then
periodically forwarded to the ODS
- involves more integration and transformation processing
- Example: Bank that stores in the ODS an integrated individual bank
account on a weekly basis
part of the data warehouse or separate system?
20
Data Warehouse Architecture Anwendersoftware

Classes of Operational Data Stores

• Tables are copied from the

class 0
applications

DWH
ODS operational environment

• Transactions are moved to the ODS


applications

class 1
DWH in an immediate manner
ODS (range of one to two seconds)

• Activities in the operational


applications

class 2
DWH
ODS
environment are stored, integrated,
and forwarded to the ODS
applications

class 3

DWH
• ODS is fed aggregated analytical data
ODS from the data warehouse

• Combination of integrated data from


applications

class 4

DWH the operational environment and


ODS
aggregated data from the analytical
(W. H. Inmon: ODS Types, www.dmreview.com, 01/2000) environment
21
Data Warehouse Architecture Anwendersoftware

Distribution of DW Project Costs

DW Design Costs
Recurring DW costs
security
periodic administration
verification of the 1% occasional
activity monitor reorganization of
data monitor conformance to
2% summary table data
2% the enterprise
metadata design usage analysis data model 1%
5% 2% 2%
data rchiving
metadata 1% capacity
access/analysis
management planning
tools
end-user 3% 1%
6% disk storage
DBMS training
30%
10% 6%
monitoring of
avctivity and data
7%
network costs
10% servicing data DW refreshment
mart requests 55%
for data
21%
integration and processor costs
transformation
20%
15%

(M. Jarke et al., Fundamentals of Data Warehouses, 2002)


22
Data Warehouse Architecture Anwendersoftware

Metadata Repository

Metadata Manager Metadata


Repository

• A repository is a shared database of information about engineering


artifacts, such as software, documents, maps, information
systems, and manufactured components and systems.
• Functions of a repository:
ƒ Object management
ƒ Dynamic extensibility
ƒ Relationship management
ƒ Notification
ƒ Version management
ƒ Configuration management

(P. Bernstein: Repositories and Object Oriented Databases, 1998)


23
Data Warehouse Architecture Anwendersoftware

Metadata in Data Warehousing

• What data is available in the warehouse and where is the data


located?
• Data dictionary: Definitions of the databases and relationship
between data elements
• Data flow: Direction and frequency of data feed
• Data transformation: Transformations required when data is
moved
• Version control: Changes to metadata are stored
• Data usage statistics: A profile of data in the warehouse
• Alias information: Alias names for a field
• Security: Who is allowed to access the data
Stored in a metadata repository

Need for a standard interchange format


24
Data Warehouse Architecture Anwendersoftware

Metadata in Data Warehousing

• Criteria to identify important • Main goals:


classes of metadata in data ƒ Support development and
warehousing: operation of a data warehouse
ƒ Type of data - system integration
ƒ Abstraction - processes for DW administration
- flexible application development
ƒ User
- access rights
ƒ Origins
ƒ Provide information for data
ƒ Time
warehouse users
• Usage of metadata in data - quality of data
warehousing: - consistent terminology
ƒ passive - support for data analysis
ƒ active
ƒ semi-active

25
Data Warehouse Architecture Anwendersoftware

Metadata Management
centralized - decentralized - federated

User Access Administration Development Analysis


Tool Tool Tool

Metadata Manager

Local
Tool

Metadata Metadata Metadata Metadata


Repository Repository Repository Repository

federated metadata management

data flow
control flow

(A. Bauer, H. Günzel: Data Warehouse Systeme, 2001) 26


Data Warehouse Architecture Anwendersoftware

Summary

• Basic Components:
ƒ Data Staging Area: Extraction, Transformation, Load
ƒ Data Warehouse Database
ƒ Data Warehouse Manager
ƒ Metadata Repositories and Metadata Manager
• Data Marts: Distributed Data Warehouse
• Data Warehouse vs. Federated Information Systems
• Metadata is important to:
ƒ Support development and operation of a data warehouse
ƒ Provide information for data warehouse users
• Metadata standards are important to interchange metadata
between warehouse tools, warehouse platforms and warehouse
metadata repositories.

27

Vous aimerez peut-être aussi