Académique Documents
Professionnel Documents
Culture Documents
Module II
Content
• Basics of Data Integration (Extraction
Transformation Loading):
– Concepts of Data Integration
– Needs and Advantages of using Data Integration
– Introduction to Common Data Integration Approaches
– Meta Data - Types and Sources
– Introduction to Data Quality
– Data Profiling Concepts and Applications
– Introduction to ETL using Kettle
Functional Areas of BI
Sample BI Architecture
Detailed BI Architecture
Basic Elements of the Data Warehouse
Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, 2nd Edition, 2002
6
Data Staging Area - ETL
• EXTRACTION
– reading and understanding the source data and
copying the data needed for the data warehouse into
the staging area for further manipulation.
• TRANSFORMATION
– cleansing, combining data from multiple sources,
deduplicating data, and assigning warehouse keys
• LOADING
– loading the data into the data warehouse
presentation area
7
Concepts of Data Integration
Definition
• Development challenges
• Technological challenges
• Organizational challenges
Challenges in Data Integration: Development challenges
Unavailability of data
of Data Integration
ODS
• Operational data store
• ODS processes the operational data to provide
a homogeneous unified view which can be
utilized by analysts and report writers
• different from warehouse
• It holds current or very recent data while
warehouse hold pat history
Ralph Kimball approach
• A data warehouse is made up of all the data
marts in an enterprise
• This is a bottom –up approach
• It is faster cheaper
and less complex
Ralph Kimball approach
Inmon Approach
• A data warehouse is a subject oriented
integrated non volatile time variant collection
of data in support of management decision.
• It is a top- down approach
• It is expensive, time consuming and slow
process.
• Single version of truth
Single Version of Truth (SVoT)
• SVoT refers to one view [of data] that everyone in a
company agrees is the real, trusted number for some
operating data.
• It is the practice of delivering clear and accurate data to
decision-makers in the form of answers to highly
strategic questions.
• Effective decision-making assumes that accurate and
verified data, serving a clear and controlled purpose,
and that everyone is trusting and recognising against
that purpose.
• SVoT enables greater data accuracy, uniqueness,
timeliness, alignment, etc.
Difference between Ralph and Inmon
Difference between Ralph and Inmon
Challenges of Data Integration
• Integrating disparate data has always been a difficult
task, and given the data explosion occurring in most
organizations.
• this task is not getting any easier.
• Over 69% of respondents to survey rated data
integration issues as either a very high or high inhibitor
to implementing new applications.
• The three main data integration issues were
– data quality and security,
– lack of a business case and inadequate funding, and
– a poor data integration infrastructure.
Top Data Integration Issues
Need for Data Integration
What it means?
It is done for providing data in
a specific view as requested by DB2
users, applications, etc.
Unified
The bigger the organization SQL view of
data
gets, the more data there is and
the more data needs
Oracle
integration.
What it means?
Of benefit to decision-
makers, who have access to DB2
important information from
past studies Unified
SQL view of
data
Reduces cost, overlaps and
redundancies; reduces Oracle
exposure to risks
ADVANTAGES LIMITATIONS
• Integration at the lowest • Process would take a
level, eliminating need for considerable amount of
integration queries. time and effort
• Runtime schematic cleaning • Requires an understanding
is not needed – performed at of the domain
the data staging environment
• More scalable when
• Independent of original data accompanied with a
source metadata repository –
increased load.
• Query optimization is
possible. • Tightly coupled architecture
Other Data Integration Approaches
Employee Attendance
EmpNo EmpName SSno
10014 Prakash Ajay ADWP0011
Employee Payroll
EmpNo EmpName SSno
10014 A. Prakash ADWP0011
Employee Leave
EmpNo EmpName SSno
10014 P. Ajay ADWP0011
Example of Instance Integration (Sol.)
Project Allocate
EmpNo EmpName SSno
10014 Ajay Prakash ADWP0011
Employee Attendance
EmpNo EmpName SSno
10014 Ajay Prakash ADWP0011
Employee Payroll
EmpNo EmpName SSno
10014 Ajay Prakash ADWP0011
Employee Leave
EmpNo EmpName SSno
10014 Ajay Prakash ADWP0011
Metadata Types and Sources
What is Metadata?
• In terms of data warehouse, we can define
metadata as follows:
– Metadata is the road-map to a data warehouse.
– Metadata in a data warehouse defines the
warehouse objects.
– Metadata acts as a directory. This directory helps
the decision support system to locate the contents
of a data warehouse.
Metadata Quotes
• “Metadata is a vital element of the data
warehouse.” — William Inmon
• “Metadata is the DNA of the data warehouse.”
— Ralph Kimball
• “Metadata is analogous to the data warehouse
encyclopedia.” — Ralph Kimball
Metadata in BI Architecture
Metadata is the “…..” of data
Data vs Metadata
Ten Reasons Why Metadata is
Important
1. It’s everywhere!
2. It meets the disparate needs of the data warehouses
technical, administrative, and business user groups.
3. It contains information at least as valuable as regular data.
4. It is used to describe the semantic of concepts.
5. It facilitates the extraction, transformation and load
process.
6. It improves data security.
7. It hides implementation details.
8. We can customize how the user sees the data.
9. It helps interoperability among systems.
10. It allow us to design portable solutions.
Data warehouse architecture based on
metadata
Framework of Metadata Management
System
• In data warehouse system there are three
typical metadata management structure
– centralized structure,
– distributed structure and
– federal structure
centralized structure
• The basic concept is to establish a unified
metadata model, by which users can define and
manage various metadata, and store all the
metadata in the central metadata repository.
• Useful for medium-sized organizations,
• Accessed by all groups.
Distributed structure
• On the premise of interoperable operation
among meta database, the distributed structure
is usually adopted.
• It uses the local meta database to manage the
data respectively.
Federal structure
• Federal structure combines the advantages of
the first two structures.
• For the entire data warehouse system, it
provides a overall concept view of the
metadata management.
• Each meta database is maintained
independently, and rules its own access.
ROLAP and Metadata 1|3
ROLAP and Metadata 2|3
ROLAP and Metadata 3|3
Types of Metadata
Categories of Metadata 1|4
1. Business Metadata
2. Technical Metadata
3. Operational Metadata /Process Metadata
Categories of Metadata Business
Metadata − 2|4
• It has the data ownership information,
business definition, and changing policies.
• Example: Facts, dimensions, logical
relationships, etc.
Categories of Metadata Technical
Metadata 3|4
• Technical, or Physical, Metadata is what is
stored in your data source. It’s your physical
schema
• Technical metadata also includes structural
information such as primary and foreign key
attributes and indices.
• Exemple : Tables, fields, indexes, sources,
targets, transformations, etc.
Categories of Metadata Operational
Metadata 4|4
• Describes operations executed on the
warehouse and their results.
• It includes currency of data and data lineage
– Currency of data means whether the data is
active, archived, or purged.
– Lineage of data means the history of data
migrated and transformation applied on it.
• Example: Results of the ETL process, query
logging, etc.
Levels of Data Modeling
Oracle Administration tool
Business vs. Technical Metadata
Metadata Sources
• Data for the data warehouse comes from several
operational systems of the enterprise.
• These source systems contain different data structures.
• The data elements selected for the data warehouse have
various field lengths and data types.
• In selecting data from the source systems for the data
warehouse, you split records, combine parts of records
from different source files, and deal with multiple
coding schemes and field lengths.
• When you deliver information to the end-users, you
must be able to tie that back to the original source data
sets.
Who Uses Metadata?
Advantages
• Abstraction: the data analysts do not need to have
knowledge of the complex data sources involved in the
system. Data analysts only worry about the business
question, not about how to answer it.
• Portability: the changes on the physical model don’t
affect the logical model.
• Security: defining a strong security policy allow the
administrators to restrict the access of the users to
information that they must not know about.
• Customization: the information is adapted to the user.
Data Quality (DQ)
Why there is a discussion about DQ
• While a business intelligence system makes it
much simpler to analyze and report on the data
loaded into a data warehouse system,
the existence of data alone does not ensure that
executives make decisions smoothly;
the quality of the data is equally as important.
• The existence of data alone does not ensure
that decisions are made smoothly; the quality
of data is just as important.
Definition
• Data quality can simply be described as a
fitness for use of data.
• Data has to fit a few objectives concerning its:
1. Correctness
2. Consistency
3. Completeness
4. Validity
• The most important actions are cleaning the
data and conforming it.
Six data quality dimensions
Completeness
• Expected comprehensiveness.
• Data can be complete even if optional data is missing.
• As long as the data meets the expectations then the data is
considered complete.
• Example, a customer’s first name and last name are
mandatory but middle name is optional;
– so a record can be considered complete even if a
middle name is not available.
• Questions you can ask yourself:
– Is all the requisite information available?
– Do any data values have missing elements? Or
– Are they in an unusable state?
Consistency
• Data across all systems reflects the same information
and are in synch with each other across the enterprise.
• Examples:
• A business unit status is closed but there are sales for
that business unit.
• Employee status is terminated but pay status is active.
• Questions you can ask yourself:
– Are data values the same across the data sets?
– Are there any distinct occurrences of the same data
instances that provide conflicting information?
Conformity
And Applications
Definition
• Data Profiling is a systematic analysis of the
content of a data source. aka data archeology.
• The data profiling process cannot identify
inaccurate data; it can only identify business
rules violations and anomalies.
Data Profiling
• Data profiling is also known as data
assessment, data discovery or data quality
analysis.
• It is a process of examining the data available
in an existing data source and collecting
statistics and information about it.
• It is also defined as a systematic up front
analysis of the content of the data source.
• It is a first step of improving data quality.
Data Profiling Architecture
Data Profiling Architecture
Layer Application
Web-UI Layer User interface for ER model, business rules, KPI dashboard,
batch or real-time job maintenance, input/output data source
configuration
Function Layer Scores of data profiling missions
Algorithm Utilize data mining, machine learning, and statistics, or
Layer other algorithms
Parallelism Apache Spark for static data and full volume data; Apache
Layer Storm for real-time data
Data Layer Business rules, configuration data, metadata store in
Hadoop, traditional databases
Hardware Integrate CPU and GPU clusters for improving
Layer performance, especially real-time or machine learning tasks.
Types of data profiling
Technical Business
Domain Domain
Logical-
Set Metadata Presentation Content
Rule
Profiling Profiling Profiling Profiling
Profiling
Data Profiling Tasks
• Metadata Profiling:
– discovering metadata information, such as data structures, creators, times
of creation, primary keys, and foreign keys.
• Presentation Profiling:
– finding data patterns, including text patterns, time patterns, and number
patterns. such as address pattern, date patterns, and telephone patterns.
• Content Profiling:
– reviewing data basis information, including accuracy, precision,
timeliness, null or non-null.
• Set Profiling:
– analyzing data from collections or groups; for example statistics,
distribution, cardinality, frequency, uniqueness, row count, maximum or
minimum values
• Logical Rule Profiling:
– reviewing data based on business logical rules such data logical
meanings, business rules, and functional dependency.
Data Profiling Analysis
• The following figure shows how data profiling
analyses work together:
Column analysis
• Column analysis is a prerequisite to all other analyses
except for cross-domain analysis.
• During a column analysis job, the column or field data is
evaluated in a table or file and a frequency distribution is
created.
• A frequency distribution summarizes the results for each
column such as statistics and inferences about the
characteristics of your data.
• A frequency distribution is also used as the input for
subsequent analyses such as primary key analysis and
baseline analysis.
• The column analysis process incorporates four analyses:
Column analysis
• The column analysis process incorporates four
analyses:
– Domain analysis
– Data classification analysis
– Format analysis
– Data properties analysis
Column analysis | Domain analysis
Validate
Archive
Clean up
Various Stages in ETL
DATA MAPPING DATA STAGING
VALIDATE STAGE
REFERENCE
EXTRACT TRANSFORM
ARCHIVE
------
------
------
AUDIT REPORTS
PUBLISH
Extract, Transform and Load
Semantic mapping
• A metadata registry can be consulted to look up data
element synonyms
• If the destination column does not match the source
column, the mappings will be made if these data elements
are listed as synonyms in the metadata registry
• Only able to discover exact matches between columns of
data and will not discover any transformation logic or
exceptions between columns
Data Staging
A data staging area is an intermediate storage area between the
sources of information and the Data Warehouse (DW) or Data Mart
(DM)
At a specific point in time, only the data that has changed since a
well-defined event back in history will be extracted. This event
may be the last time of extraction or a more complex business
event like the last booking day of a fiscal period.
To identify this delta change there must be a possibility to
identify all the changed information since this specific time
event. This information can be either provided by the source data
itself like an application column, reflecting the last-changed
timestamp or a change table where an appropriate additional
mechanism keeps track of the changes besides the originating
transactions.
In most cases, using the latter method means adding extraction
logic to the source system.
Online extraction
The data is extracted directly from the source
system itself.
The extraction process can connect directly to the
source system to access the source tables
themselves or to an intermediate system that stores
the data in a preconfigured manner (for example,
snapshot logs or change tables).
The intermediate system is not necessarily
physically different from the source system.
Offline extraction
The data is not extracted directly from the source
system but is staged explicitly outside the original
source system.
The data already has an existing structure (for
example, redo logs, archive logs or transportable
tablespaces) or was created by an extraction
routine.
Data Transformation
MULTISTAGE TRANSFORMATION
Data Transformation
VALIDATE
EXTERNAL TABLE
CUSTOMER
KEYS
CONVERT SOURCE
INSERT INTO TARGET TABLE
KEYS TO
WAREHOUSE
WAREHOUSE KEYS
TABLE
PIPELINED TRANSFORMATION
Data Loading
• The load phase loads the data into the end target, usually the
data warehouse (DW). Depending on the requirements of the
organization, this process varies widely.