Vous êtes sur la page 1sur 31

Data Warehouse

Architecture
Objectives
► At the end of this lesson, you will know :
 Data Warehouse Architectures
 Components of Data Warehousing Architecture
 An overview of each of the components
 Considerations for Data Warehouse Design
 Common mistakes in Warehouse designs
 An overview of Warehouse on the web
Warehouse Architecture - 1

EIS /DSS

Metadata

Select Query Tools


Extract
Transform Data
Integrate Warehouse OLAP/ROLAP
Maintain

Web Browsers
Operational
Systems/Data Middleware/
API Data Mining
Data
Preparation

Enterprise Data Warehouse


Warehouse Architecture - 2

Metadata

EIS /DSS
Data Mart

Metadata
Select Query Tools
Extract
Transform Data Mart
Integrate OLAP/ROLAP
Maintain Metadata

Web Browsers
Operational Data Mart
Systems/Data Middleware/
Data API Data Mining
Preparation

Single Department Data Mart


Warehouse Architecture - 3
Data
Marts

EIS /DSS
Metadata

Select Query Tools


Extract Data
Transform Warehouse
Integrate OLAP/ROLAP
Maintain

Web Browsers
Operational
Systems/Data Middleware/
Operational
API Data Mining
Data Data Store
Preparation

Multi-tiered Data Warehouse


Data Warehouse Architectures
► There
are three schools of thought about
DW architectures
 One supports Dimensional Modeling all through
(Ralph Kimball)
 Second supports ER for Data Warehouse and
Star Schemas for Data Marts
 Third supports ER model for DW (NCR)
Kimball’s View
Operational Systems
Each Star is
Presentation Server a Data Mart
Staging Area and has both
summary and
detail data

LAN
Data Warehouse
Server
Processes

•Extract
•Scrubbing
•Transformation
DW is sum
•Load Jobs total of all
•Aggregation Jobs Data Marts
•Replication
•Monitoring
•Management
•Meta Data Repository DW Bus using
•Meta Data Population Conformed Dimensions
•Meta Data Maintenance

Multiple Data Marts With Conformed Dimensions


Inmon’s View
Operational Systems
Staging Area Data Warehouse Data Marts

LAN
Data Warehouse Server
Processes
•Extract
•Scrubbing
•Transformation
•Load Jobs
•Aggregation Jobs
•Replication
•Monitoring Detail Data
•Management in ER format
•Meta Data Repository
•Meta Data Population
•Meta Data Maintenance
Summarized Data
in Star formats

Data Warehouse (ER) Feeding Multiple Data Marts (Star Schema)


Components of a Data Warehouse
Architecture
► Source Databases
► Data extraction/transformation/load (ETL) tool
► Data warehouse maintenance and administration
tools
► Data modeling tool or interface to external data
models
► Warehouse databases
► End-user data access and analysis tools
Components of a Data Warehouse
Architecture
Data
Cleansing Data
Tools Modeling
Central ROLAP Data Access
Tool and Analysis
Metadata Engine
Tools
Source ETL Tool -Managed Query
Central
Databases Warehouse RDBMS -Desktop OLAP
(RDBMS)
-ROLAP

Local meta -MOLAP


data - Data Mining
MDDB

Warehouse Architected
Admin Data marts
Tool
Warehouse Databases

Data Warehouse Is Not Just About Data... But Tools Too


Source Databases - Characteristics
► Legacy, relational, text or external sources
► Designed for high-speed transaction processing
► Real-time, current, volatile data
► Fast response for larger numbers of concurrent users
► Many short transactions
► Update-intensive; modifications by row
► Inquiry-oriented; access by keys
► High integrity, security, recoverability
► Source data is often inconsistent and poorly modeled
Data Cleaning Tools
► To clean data at the source
► Clean up source data in-place on the host
► Business rule discovery tools which analyze the
source data and write cleaning rules based on
lexical analysis and AI techniques
► ETL tools have limited yet adequate data cleansing
functionality
Data Extraction, Transformation and Load
Tools (ETL)
► Support data extraction, cleansing, aggregation,
reorganization, transformation, and load operations
► Generate and maintain centralized metadata
► Closely integrated with RDBMS
► Filter data, convert codes, calculate derived values,
map many source data fields to one target data field
► Automatic generation of data extract programs
► High speed loading of target data warehouses
► Employs Middle Ware for near Real Time ETL
Data Modeling Tools
► Support Data Warehouse design as a modeling
technique
► Support both ER Modeling and Dimensional
Modeling
► Reverse Engineering and Forward Engineering
► Mapping of source data to target data
► Data Dictionary
► Reporting
Central Metadata
► Metadata repository is the foundation of data
warehouse. It stores :
 Technical metadata
 Business metadata
► Metadata is stored in the central metadata repository
and may be distributed to local metadata repositories
► Metadata is generated and maintained by an ETL
tool as part of the specification of
extraction/transformation/load process
Warehouse Administration Tools
► To set up users, authorize access, monitor access and
usage patterns, monitor ad hoc queries, analyze cost
structure of queries
► To restructure physical database structures to
improve performance
► To block long queries and reschedule them to run as
off-hours batch jobs
► Usually packaged with the RDBMS chosen for the
data warehouse
Warehouse Databases - Characteristics
► Central Data Warehouse - almost always a relational
database
► Data Marts - relational, multidimensional or hybrid
► Characterized by :
 small numbers of complex and large queries
 read-intensive, analytic, content driven activity
 subject-oriented, integrated, detail and summary data
 structured for analysis, browsing, surfing
Data Access and Analysis Tools
► To query the target database, specify reports, and perform
OLAP functions
► Query Tools
 Ad hoc query tools
 Managed query tools
► OLAP
 Desktop OLAP
 Relational OLAP
 Multidimensional OLAP
 Hybrid OLAP
► Data Mining Tools
Design Considerations
 Platform - SMP, MPP, NT, Unix
 Target Database - RDBMS, MDDB
 Partitioning
 Data Preparation - Data Quality Audit, Cleansing,
Extraction, Transformation
 Modeling - Facts & Dimensions
 Information Directory - Metadata Management
 Warehouse Administration
 End User Tools
 Granularity - Detail and Summarization
Data Warehouse Hardware

Hardware Considerations
•Parallelism
•SMP or MPP
•Disk Storage
Hardware Considerations
► Parallelism
 Most deployments of VLDB Data Warehouses
are on SMP or MPP
► Fault Tolerance
 Redundant Array of Independent Disks (RAID)
 Arrangement of disks to achieve higher fault
tolerance
Hardware Considerations
► Three options for Hardware
 Symmetric Multiprocessing (SMP)
►Shared Memory Architecture
 Massively Parallel Processing (MPP)
►Shared Nothing Architecture
►Each node has its own memory and I/O

 Non Uniform Memory Access (NUMA)


►Cluster of SMP machines
►Classified as large SMP machines
SMP vs. MPP machines
SMP MPP
For Mission Critical OLTP Complex Analytical large
or medium DSS Scale DSS
Scale to 10-12 CPUs Scale to more than 100
(now 30) CPUs

Growth is Slow and Growth is rapid and


Steady unpredictable

Database Size < 200GB Database Size > 500 GB

Aim is Automation or Basic Primary aim is strategic


Decision Support advantage
Server Scalability
• Massively Parallel Processing Machines
• Can scale upto 100s of processors
MPP • Suitable for Data Warehouse > 500 GB
• Extremely complex data mining algorithms

• Non Uniform Memory Architecture


• High End SMP clusters
NUMA • Suitable for Data Warehouses < 500 GB

• Symmetric Multiprocessing Machines


SMP • Scale upto 10-12 CPUs
• Good for Data Warehouse < 200 GB

• Single CPU Systems


Entry level • Desktops
• Suitable only for small datamarts (<20GB)
Fault Tolerance - RAID
► Redundant Array of Inexpensive Disks
 RAID Level 1
►Disk Mirroring
 RAID Level 3
►Data Striping with Parity Disk
 RAID Level 4
►Data Block Striping with Parity
 RAID Level 5
►Multiple Readers and Writers provide higher I/O
Common Mistakes
► “Virtual Data Warehouse” - does not incorporate a
Central Data Warehouse database
► “Stovepipe” data marts - do not integrate with a
central metadata repository
► Populating “dirty” source data - missing, inconsistent,
or erroneous data are disastrous for DSS
► Implementing enterprise data warehouse as a single,
large, top-down development effort
Data Warehousing and Web

Benefits
► Extends the reach to more users
► Data can be shared with external
users
► Security is a major issue
► Client software and management
costs go down

Working in Web Time


Data Warehousing and the Web
First Generation Decision Support

HTML HTML
Request Reports

Data
Browser Web
Reports Warehouse
Server

Publish existing static reports on the web


Data Warehousing and the Web
Second Generation Decision Support
CGI
NSAPI SQL
HTML ISAPI
Request

HTML HTML Query


Page
Data
Page Results
Web Query Warehouse
Browser
Server Engine

Interactive queries on the web


Data Warehousing and the Web
Third Generation Decision Support

Applet
ActiveX SQL
HTTP Request

Download
Applet/ActiveX
Query
Browser Results Data
Warehouse
Web Query
Server Engine

Complex OLAP on the web


Questions

Vous aimerez peut-être aussi