Vous êtes sur la page 1sur 192

Business Intelligence

Module II
Content
• Basics of Data Integration (Extraction
Transformation Loading):
– Concepts of Data Integration
– Needs and Advantages of using Data Integration
– Introduction to Common Data Integration Approaches
– Meta Data - Types and Sources
– Introduction to Data Quality
– Data Profiling Concepts and Applications
– Introduction to ETL using Kettle
Functional Areas of BI
Sample BI Architecture
Detailed BI Architecture
Basic Elements of the Data Warehouse

Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, 2nd Edition, 2002
6
Data Staging Area - ETL
• EXTRACTION
– reading and understanding the source data and
copying the data needed for the data warehouse into
the staging area for further manipulation.
• TRANSFORMATION
– cleansing, combining data from multiple sources,
deduplicating data, and assigning warehouse keys
• LOADING
– loading the data into the data warehouse
presentation area

7
Concepts of Data Integration
Definition

Process of coherent merging of data from various data sources and


presenting a cohesive/consolidated view to the user

• Involves combining data residing at different sources


and providing users with a unified view of the data.
• Significant in a variety of situations; both
 commercial (e.g., two similar companies trying to
merge their database)
Scientific (e.g., combining research results from
different bioinformatics research repositories)
Data Integration Example
• We have University Website that displays toppers list.
• There would be many colleges affiliated to this university
and would be maintaining the marks of students from their
college locally.
• The university website would need to integrate the data
stored locally with each college integrate it and present to
the end user in a consistent format.
• Data integration plays vital role in such a scenario.
– One college may maintain data in excel files,
– the other in SQL database and still another in Oracle database.
– The Website application would need to integrate all this data and
present in a unified manner to the end user.
According to your understanding
What are the problems faced in Data Integration?
Challenges in Data Integration

• Development challenges
• Technological challenges
• Organizational challenges
Challenges in Data Integration: Development challenges

 Translation of relational database to object-oriented


applications

 Consistent and inconsistent metadata

 Handling redundant and missing data

 Normalization of data from different sources


Challenges in Data Integration: Technological challenges

 Various formats of data

 Structured and unstructured data

 Huge volumes of data


Challenges in Data Integration: Organizational challenges

 Unavailability of data

 Manual integration risk, failure


Technologies in Data Integration

• Integration is divided into two main approaches:


 Schema integration – reconciles schema elements
 Instance integration – matches tuples and attribute values
• The technologies that are used for data integration include:
 Data interchange
 Object Brokering
 Modeling techniques
• Entity-Relational Modeling
• Dimensional Modeling
Schema Integration
• Multiple data sources may provide data on the
same entity type.
• The main goal is to allow applications to
transparently view and
• query this data as one uniform data source, and
this is done using various mapping rules to
handle structural differences.
Instance Integration
• Data integration from multiple heterogeneous
data sources has become a high-priority task in
many large enterprises.
• Hence to obtain the accurate semantic information
on the data content, the information is being
retrieved directly from the data.
• It identifies and integrates all the instance of the
data items that represents the real-world entity,
distinct from the schema integration.
Electronic Data Interchange (EDI)
• It refers to the structured transmission of data
between organizations by electronic means.
• It is used to transfer electronic documents from
one computer system to another (ie) from one
trading partner to another trading partner.
• It is more than mere E-mail; for instance,
organizations might replace bills of lading and
even checks with appropriate EDI messages.
Object Brokering/Object Request Broker (ORB)

• An ORB is a piece of middleware software


that allows programmers to make programs
calls from one computer to another via a
network.
• It handles the transformation of in-process data
structure to and from the byte sequence.
Entity Relational Modeling
• Optimized for transactional data,
• Entity-relationship modeling (ERM) is a
database modeling method, used to produce a
type of conceptual schema or semantic data
model of a system, often a relational database,
and its requirements in a top-down fashion.
Dimensional Modeling
• Optimized for query analysis and performance,
Dimensional modeling (DM) is the name of a
logical design technique often used for data
warehouses.
• It is a design technique for databases intended
to support end-user queries in a data
warehouse and
• is oriented around understandability, contrary
to database administration.
Needs and Advantages

of Data Integration
ODS
• Operational data store
• ODS processes the operational data to provide
a homogeneous unified view which can be
utilized by analysts and report writers
• different from warehouse
• It holds current or very recent data while
warehouse hold pat history
Ralph Kimball approach
• A data warehouse is made up of all the data
marts in an enterprise
• This is a bottom –up approach
• It is faster cheaper
and less complex
Ralph Kimball approach
Inmon Approach
• A data warehouse is a subject oriented
integrated non volatile time variant collection
of data in support of management decision.
• It is a top- down approach
• It is expensive, time consuming and slow
process.
• Single version of truth
Single Version of Truth (SVoT)
• SVoT refers to one view [of data] that everyone in a
company agrees is the real, trusted number for some
operating data.
• It is the practice of delivering clear and accurate data to
decision-makers in the form of answers to highly
strategic questions.
• Effective decision-making assumes that accurate and
verified data, serving a clear and controlled purpose,
and that everyone is trusting and recognising against
that purpose.
• SVoT enables greater data accuracy, uniqueness,
timeliness, alignment, etc.
Difference between Ralph and Inmon
Difference between Ralph and Inmon
Challenges of Data Integration
• Integrating disparate data has always been a difficult
task, and given the data explosion occurring in most
organizations.
• this task is not getting any easier.
• Over 69% of respondents to survey rated data
integration issues as either a very high or high inhibitor
to implementing new applications.
• The three main data integration issues were
– data quality and security,
– lack of a business case and inadequate funding, and
– a poor data integration infrastructure.
Top Data Integration Issues
Need for Data Integration

What it means?
 It is done for providing data in
a specific view as requested by DB2
users, applications, etc.
Unified
 The bigger the organization SQL view of
data
gets, the more data there is and
the more data needs
Oracle
integration.

 Increases with the need for


data sharing.
Advantages of Using Data Integration

What it means?
 Of benefit to decision-
makers, who have access to DB2
important information from
past studies Unified
SQL view of
data
 Reduces cost, overlaps and
redundancies; reduces Oracle
exposure to risks

 Helps to monitor key


variables like trends and
consumer behaviour, etc.
Common Approaches to Data Integration
Data Integration Framework
Data Integration Approaches
• IBM defines as:
“Data integration is the combination of technical
and business processes used to combine data
from disparate sources into meaningful and
valuable information.”
• There are five types of Common Data
Integration Approaches.
Characteristics of Data Integration
• Data integration involves a framework of applications,
techniques, technologies, and products for providing a
unified and consistent view of enterprise business data.
– Applications are custom-built and vendor-developed
solutions that utilize one or more data integration products.
– Products are off-the-shelf commercial solutions that
support one or more data integration technologies.
– Technologies implement one or more data integration
techniques.
– Techniques are technology-independent approaches for
doing data integration.
Data Integration Approaches
• Data Consolidation
• Data Propagation
• Data Virtualization
• Data Federation
• Data Warehouse
1 Data Consolidation
• Data consolidation physically brings data
together from several separate systems,
creating a version of the consolidated data in
one data store.
• Often the goal of data consolidation is to
reduce the number of data storage locations.
Extract, transform, and load (ETL) technology
supports data consolidation.
Data Consolidation
2 Data Propagation | 1
• Data propagation is the use of applications to
copy data from one location to another.
• It is event-driven and can be done synchronously
or asynchronously.
• Most synchronous data propagation supports a
two-way data exchange between the source and
the target.
• Enterprise application integration (EAI) and
enterprise data replication (EDR) technologies
support data propagation.
2 Data Propagation | 2
• EAI integrates application systems for the
exchange of messages and transactions.
• It is often used for real-time business
transaction processing.
• Integration platform as a service (iPaaS) is a
modern approach to EAI integration.
Enterprise application Integration
Architecture
Enterprise Data Replication
Architecture
2 Data Propagation | 3
• EDR typically transfers large amounts of data
between databases, instead of applications.
• Base triggers and logs are used to capture and
disseminate data changes between the source
and remote databases.
Data Propagation
Data Propagation
3 Data Virtualization
• Virtualization uses an interface to provide a
near real-time, unified view of data from
disparate sources with different data models.
• Data can be viewed in one location, but is not
stored in that single location.
• Data virtualization retrieves and interprets
data, but does not require uniform formatting
or a single point of access.
4 Data Federation
• Federation is technically a form of data virtualization.
• It uses a virtual database and creates a common data
model for heterogeneous data from different systems.
• Enterprise information integration (EII) is a technology
that supports data federation.
• It uses data abstraction to provide a unified view of data
from different sources.
• Virtualization and federation are good workarounds for
situations where data consolidation is cost prohibitive
or would cause too many security and compliance
issues.
Federated Data Warehouse
Architecture
EII Architecture
Enterprise Information Integration
Architecture
EII – Process Based
EII – Model based
Working of EII | 1
Working of EII | 2
Working of EII | 3
Working of EII | 4
5 Data Warehousing
• Data warehouses are storage repositories for
data.
• However, when the term “data warehousing,”
is used, it implies the cleansing, reformatting,
and storage of data, which is basically data
integration.
Data Warehouse
Data Warehouse – Advantage and Limitations

ADVANTAGES LIMITATIONS
• Integration at the lowest • Process would take a
level, eliminating need for considerable amount of
integration queries. time and effort
• Runtime schematic cleaning • Requires an understanding
is not needed – performed at of the domain
the data staging environment
• More scalable when
• Independent of original data accompanied with a
source metadata repository –
increased load.
• Query optimization is
possible. • Tightly coupled architecture
Other Data Integration Approaches

• Memory-mapped data structure:


 Useful when needed to do in-memory data manipulation and
data structure is large. It’s mainly used in the dot net platform
and is always performed with C# or using VB.NET
 It’s is a much faster way of accessing the data than using
Memory Stream.
Data Integration Approaches

• Integration is divided into two main


approaches:
– Schema integration – reconciles schema elements
– Instance integration – matches tuples and
attribute values
Schema Integration
• Multiple data sources may provide data on the
same entity type.
• The main goal is to allow applications to
transparently view and
• query this data as one uniform data source, and
this is done using various mapping rules to
handle structural differences.
Example of Schema Integration
Instance Integration
• Data integration from multiple heterogeneous
data sources has become a high-priority task in
many large enterprises.
• Hence to obtain the accurate semantic information
on the data content, the information is being
retrieved directly from the data.
• It identifies and integrates all the instance of the
data items that represents the real-world entity,
distinct from the schema integration.
Example of Instance Integration
Project Allocate
EmpNo EmpName SSno
10014 Ajay Prakash ADWP0011

Employee Attendance
EmpNo EmpName SSno
10014 Prakash Ajay ADWP0011
Employee Payroll
EmpNo EmpName SSno
10014 A. Prakash ADWP0011

Employee Leave
EmpNo EmpName SSno
10014 P. Ajay ADWP0011
Example of Instance Integration (Sol.)
Project Allocate
EmpNo EmpName SSno
10014 Ajay Prakash ADWP0011

Employee Attendance
EmpNo EmpName SSno
10014 Ajay Prakash ADWP0011
Employee Payroll
EmpNo EmpName SSno
10014 Ajay Prakash ADWP0011

Employee Leave
EmpNo EmpName SSno
10014 Ajay Prakash ADWP0011
Metadata Types and Sources
What is Metadata?
• In terms of data warehouse, we can define
metadata as follows:
– Metadata is the road-map to a data warehouse.
– Metadata in a data warehouse defines the
warehouse objects.
– Metadata acts as a directory. This directory helps
the decision support system to locate the contents
of a data warehouse.
Metadata Quotes
• “Metadata is a vital element of the data
warehouse.” — William Inmon
• “Metadata is the DNA of the data warehouse.”
— Ralph Kimball
• “Metadata is analogous to the data warehouse
encyclopedia.” — Ralph Kimball
Metadata in BI Architecture
Metadata is the “…..” of data
Data vs Metadata
Ten Reasons Why Metadata is
Important
1. It’s everywhere!
2. It meets the disparate needs of the data warehouses
technical, administrative, and business user groups.
3. It contains information at least as valuable as regular data.
4. It is used to describe the semantic of concepts.
5. It facilitates the extraction, transformation and load
process.
6. It improves data security.
7. It hides implementation details.
8. We can customize how the user sees the data.
9. It helps interoperability among systems.
10. It allow us to design portable solutions.
Data warehouse architecture based on
metadata
Framework of Metadata Management
System
• In data warehouse system there are three
typical metadata management structure
– centralized structure,
– distributed structure and
– federal structure
centralized structure
• The basic concept is to establish a unified
metadata model, by which users can define and
manage various metadata, and store all the
metadata in the central metadata repository.
• Useful for medium-sized organizations,
• Accessed by all groups.
Distributed structure
• On the premise of interoperable operation
among meta database, the distributed structure
is usually adopted.
• It uses the local meta database to manage the
data respectively.
Federal structure
• Federal structure combines the advantages of
the first two structures.
• For the entire data warehouse system, it
provides a overall concept view of the
metadata management.
• Each meta database is maintained
independently, and rules its own access.
ROLAP and Metadata 1|3
ROLAP and Metadata 2|3
ROLAP and Metadata 3|3
Types of Metadata
Categories of Metadata 1|4
1. Business Metadata
2. Technical Metadata
3. Operational Metadata /Process Metadata
Categories of Metadata Business
Metadata − 2|4
• It has the data ownership information,
business definition, and changing policies.
• Example: Facts, dimensions, logical
relationships, etc.
Categories of Metadata Technical
Metadata 3|4
• Technical, or Physical, Metadata is what is
stored in your data source. It’s your physical
schema
• Technical metadata also includes structural
information such as primary and foreign key
attributes and indices.
• Exemple : Tables, fields, indexes, sources,
targets, transformations, etc.
Categories of Metadata Operational
Metadata 4|4
• Describes operations executed on the
warehouse and their results.
• It includes currency of data and data lineage
– Currency of data means whether the data is
active, archived, or purged.
– Lineage of data means the history of data
migrated and transformation applied on it.
• Example: Results of the ETL process, query
logging, etc.
Levels of Data Modeling
Oracle Administration tool
Business vs. Technical Metadata
Metadata Sources
• Data for the data warehouse comes from several
operational systems of the enterprise.
• These source systems contain different data structures.
• The data elements selected for the data warehouse have
various field lengths and data types.
• In selecting data from the source systems for the data
warehouse, you split records, combine parts of records
from different source files, and deal with multiple
coding schemes and field lengths.
• When you deliver information to the end-users, you
must be able to tie that back to the original source data
sets.
Who Uses Metadata?
Advantages
• Abstraction: the data analysts do not need to have
knowledge of the complex data sources involved in the
system. Data analysts only worry about the business
question, not about how to answer it.
• Portability: the changes on the physical model don’t
affect the logical model.
• Security: defining a strong security policy allow the
administrators to restrict the access of the users to
information that they must not know about.
• Customization: the information is adapted to the user.
Data Quality (DQ)
Why there is a discussion about DQ
• While a business intelligence system makes it
much simpler to analyze and report on the data
loaded into a data warehouse system,
the existence of data alone does not ensure that
executives make decisions smoothly;
the quality of the data is equally as important.
• The existence of data alone does not ensure
that decisions are made smoothly; the quality
of data is just as important.
Definition
• Data quality can simply be described as a
fitness for use of data.
• Data has to fit a few objectives concerning its:
1. Correctness
2. Consistency
3. Completeness
4. Validity
• The most important actions are cleaning the
data and conforming it.
Six data quality dimensions
Completeness
• Expected comprehensiveness.
• Data can be complete even if optional data is missing.
• As long as the data meets the expectations then the data is
considered complete.
• Example, a customer’s first name and last name are
mandatory but middle name is optional;
– so a record can be considered complete even if a
middle name is not available.
• Questions you can ask yourself:
– Is all the requisite information available?
– Do any data values have missing elements? Or
– Are they in an unusable state?
Consistency
• Data across all systems reflects the same information
and are in synch with each other across the enterprise.
• Examples:
• A business unit status is closed but there are sales for
that business unit.
• Employee status is terminated but pay status is active.
• Questions you can ask yourself:
– Are data values the same across the data sets?
– Are there any distinct occurrences of the same data
instances that provide conflicting information?
Conformity

• Data is following the set of standard data


definitions like data type, size and format.
• For example, date of birth of customer is in the
format “mm/dd/yyyy”
• Questions you can ask yourself:
– Do data values comply with the specified formats?
– If so, do all the data values comply with those
formats?
– Maintaining conformance to specific formats is
important.
Integrity

• Integrity means validity of data across the relationships and


ensures that all data in a database can be traced and
connected to other data.
• For example,
– In a customer database, there should be a valid customer,
addresses and relationship between them.
– If there is an address relationship data without a customer then
that data is not valid and is considered an orphaned record.
• Ask yourself:
– Is there are any data missing important relationship linkages?
• The inability to link related records together may actually
introduce duplication across your systems.
Accuracy

• Degree to which data correctly reflects the real world object


OR an event being described.
• Examples:
– Sales of the business unit are the real value.
– Address of an employee in the employee database is the real
address.
• Questions you can ask yourself:
– Do data objects accurately represent the “real world” values
they are expected to model?
– Are there incorrect spellings of product or person names,
addresses, and even untimely or not current data?
• These issues can impact operational and analytical
applications.
Timeliness

• Timeliness references whether information is available when it


is expected and needed.
• Timeliness of data is very important.
• This is reflected in:
– Companies that are required to publish their quarterly
results within a given frame of time
– Customer service providing up-to date information to the
customers
• The timeliness depends on user expectation.
• Online availability of data could be required for room
allocation system in hospitality, but nightly data could be
perfectly acceptable for a billing system.
Source of Data Quality Compromise
• For any DW system, data flows from the operational systems, to
the data staging area where
– data collected from multiple sources is integrated, cleansed and
transformed; and from staging area, data is loaded in to the data warehouse.

• Quality of data can be compromised depending upon how data is


extracted, integrated, cleansed, transformed and loaded in the DW.
• These stages are source of error or stages of possible quality
compromise.
• Data quality compromised in one stage leads to errors in the next
successive stages.
Ten Reasons for compromise in DQ
1. Selection of sources that do not comply with business rules
2. Lack of data validation techniques practiced in source
systems
3. Representation of data in different formats
4. Inability to update data in a timely manner
5. Data inconsistency problems in source systems
6. Lack of data quality testing performed on source systems
7. Missing values in certain columns of source system’s table
8. Different default values used for missing columns in the
source system
9. Absence of a centralized metadata component.
10. Ignoring the storage of data cleaning rules in the metadata
repository.
Data quality issues for stakeholders
Data Quality Tests
• There are two types of test
– syntax and
– reference tests.
• The syntax tests will report dirty data based on
character patterns, invalid characters, incorrect
lower or upper case order, etc.
• The reference tests will check the integrity of the
data according to the data model. So, for example
a customer ID which does not exist in a data
warehouse customers dictionary table will be
reported.
ETL Process for Data Quality
Data Quality in ETL
Checking data quality during ETL testing
involves performing quality checks on data
that is loaded in the target system. It includes
the following tests
1. Number check
– The Number format should be same across the
target system. For example, in the source system,
the format of numbering the columns is x.30, but if
the target is only 30, then it has to load not
prefixing x. in target column number.
Data Quality in ETL
2. Date Check
– The Date format should be consistent in both the
source and the target systems.
– For example, it should be same across all the records.
The Standard format is: yyyy-mm-dd.
3. Precision Check
– Precision value should display as expected in the
target table.
– For example, in the source table, the value is
15.2323422, but in the target table, it should display
as 15.23 or round of 15.
Data Quality in ETL
4. Data Check
– It involves checking the data as per the business
requirement.
– The records that don’t meet certain criteria should be
filtered out.
– Example − Only those records whose date_id >=2015
and Account_Id != ‘001’ should load in the target table.
5. Null Check
– Some columns should have Null as per the requirement
and possible values for that field.
– Example − Termination Date column should display Null
unless and until its Active status Column is “T” or
“Deceased”.
Data Quality challenges
• The challenges addressed by data quality
platforms can be broken down into business
and technical requirements.
Technical DQ challenges
• The technical data quality problems are usually caused
by data entry errors, system field limitations, mergers
and acquisitions, system migrations.
• Example:
– Inconsistent standards and discrepancies in data format,
structure and values
– Missing data, fields filled with default values or nulls
– Spelling errors
– Data in wrong fields
– Buried information
– Data anomalies
Business DQ challenges
• To be able to measure Data Quality, data should
be divided into quantifiable units (data fields and
rules) that can be tested for completeness and
validity.
• Example:
– Reports are accurate and credible
– Data driven business process work flawlessly
– Shipments go out on time
– Invoices are accurate
– Data quality should be a driver for successful ERP,
CRM or DSS implementations
Data Quality Parameters and Metrics
Data Profiling Concepts

And Applications
Definition
• Data Profiling is a systematic analysis of the
content of a data source. aka data archeology.
• The data profiling process cannot identify
inaccurate data; it can only identify business
rules violations and anomalies.
Data Profiling
• Data profiling is also known as data
assessment, data discovery or data quality
analysis.
• It is a process of examining the data available
in an existing data source and collecting
statistics and information about it.
• It is also defined as a systematic up front
analysis of the content of the data source.
• It is a first step of improving data quality.
Data Profiling Architecture
Data Profiling Architecture
Layer Application
Web-UI Layer User interface for ER model, business rules, KPI dashboard,
batch or real-time job maintenance, input/output data source
configuration
Function Layer Scores of data profiling missions
Algorithm Utilize data mining, machine learning, and statistics, or
Layer other algorithms
Parallelism Apache Spark for static data and full volume data; Apache
Layer Storm for real-time data
Data Layer Business rules, configuration data, metadata store in
Hadoop, traditional databases
Hardware Integrate CPU and GPU clusters for improving
Layer performance, especially real-time or machine learning tasks.
Types of data profiling

• There are three main types of data profiling:


– Structure discovery
– Content discovery
– Relationship discovery
Structure discovery
• Validating that data is consistent and formatted
correctly, and performing mathematical checks
on the data (e.g. sum, minimum or maximum).
• Structure discovery helps understand how well
data is structured.
• For example, what percentage of phone
numbers do not have the correct number of
digits.
Content discovery

• Looking into individual data records to


discover errors.
• Content discovery identifies
– which specific rows in a table contain problems,
and
– which systemic issues occur in the data
• For example, phone numbers with no area
code
Relationship discovery

• Discovering how parts of the data are interrelated.


• For example, key relationships between database
tables, references between cells or tables in a
spreadsheet.
• Understanding relationships is crucial to reusing
data;
– related data sources should be united into one or
imported in a way that preserves important
relationships
Data Profiling Tasks
• Data profiling tasks are categories into two
domain
Data
profiling
Tasks

Technical Business
Domain Domain

Logical-
Set Metadata Presentation Content
Rule
Profiling Profiling Profiling Profiling
Profiling
Data Profiling Tasks
• Metadata Profiling:
– discovering metadata information, such as data structures, creators, times
of creation, primary keys, and foreign keys.
• Presentation Profiling:
– finding data patterns, including text patterns, time patterns, and number
patterns. such as address pattern, date patterns, and telephone patterns.
• Content Profiling:
– reviewing data basis information, including accuracy, precision,
timeliness, null or non-null.
• Set Profiling:
– analyzing data from collections or groups; for example statistics,
distribution, cardinality, frequency, uniqueness, row count, maximum or
minimum values
• Logical Rule Profiling:
– reviewing data based on business logical rules such data logical
meanings, business rules, and functional dependency.
Data Profiling Analysis
• The following figure shows how data profiling
analyses work together:
Column analysis
• Column analysis is a prerequisite to all other analyses
except for cross-domain analysis.
• During a column analysis job, the column or field data is
evaluated in a table or file and a frequency distribution is
created.
• A frequency distribution summarizes the results for each
column such as statistics and inferences about the
characteristics of your data.
• A frequency distribution is also used as the input for
subsequent analyses such as primary key analysis and
baseline analysis.
• The column analysis process incorporates four analyses:
Column analysis
• The column analysis process incorporates four
analyses:
– Domain analysis
– Data classification analysis
– Format analysis
– Data properties analysis
Column analysis | Domain analysis

• Identifies invalid and incomplete data values.


Invalid and incomplete values impact the
quality of your data by making the data
difficult to access and use.
• You can use the results from domain analysis
when you want to use a data cleansing tool to
remove the anomalies.
Column analysis | Data classification
analysis
• Infers a data class for each column in your
data.
• Data classes categorize data. If your data is
classified incorrectly, it is difficult to compare
the data with other domains of data.
• You compare domains of data when you want
to find data that contains similar values.
Column analysis | Data properties
analysis
• Data properties analysis compares the
accuracy of defined properties about your data
before analysis to the system-inferred
properties that are made during analysis.
• Data properties define the characteristics of
data such as field length or data type.
• Consistent and well-defined properties help to
ensure that data is used efficiently.
Column analysis | Format analysis

• Creates a format expression for the values in


your data. A format expression is a pattern that
contains a character symbol for each distinct
character in a column. For example, each
alphabetic character might have a character
symbol of A and numeric characters might
have a character symbol of 9. Accurate formats
ensure that your data is consistent with defined
standards.
Key and relationship analysis
• During a key and cross-domain analysis job, your data is
assessed for relationships between tables.
• The values in your data are evaluated for foreign key
candidates, and defined foreign keys.
• A column might be inferred as a candidate for a foreign key
when the values in the column match the values of an
associated primary or natural key. If a foreign key is
incorrect, the relationship that it has with a primary or
natural key in another table is lost.
• It helps to determine whether multiple columns share a
common domain.
• A common domain exists when multiple columns contain
overlapping data.
Baseline analysis
• When you want to know if your data has
changed, you can use baseline analysis to
compare the column analysis results for two
versions of the same data source.
• The content and structure of data changes over
time when it is accessed by multiple users.
• When the structure of data changes, the system
processes that use the data are affected.
Cross-domain analysis
• Run a key and cross-domain analysis job to
identify columns that have common domain
values.
• The system determines whether a column has a
minimum percentage of distinct values that also
appear in another column.
• If the column meets the criteria for having a
common domain, the column is flagged.
• Columns that share a common domain contain
overlapping and potentially redundant data.
Foreign key analysis
• If data already contains defined foreign keys, the columns will be marked
as defined foreign keys during analysis.
• Identifying foreign keys
– To define and validate the relationships between tables, key and cross-domain analysis is
run to find foreign key candidates, select foreign keys, and then validate their referential
integrity.
– foreign keys are identified in the Key and Cross-Domain Analysis workspace.
• Running a key and cross-domain analysis job to identify foreign keys
– Key and cross-domain analysis compares columns in one table to columns in other tables
to determine if a key relationship exists between the tables.
• Referential integrity analysis
– Referential integrity analysis job runs to evaluate whether all of the references between
the foreign keys and the primary keys in your data are valid.
– Results helps to choose foreign keys or remove foreign key violations from data.
Primary key analysis
• Primary key analysis information allows to validate the primary
keys that are already defined in data and identify columns that are
candidates for primary keys.
• Single and multiple column primary key
– To understand the structure and integrity of data, key and cross-domain
analysis is run to identify and validate primary key candidates.
• Analyzing single column primary key
– To identify a single column primary key for a table or file, a column
that is a unique identifier for data is located.
• Analyzing multiple column primary key
– key analysis job is run to compare concatenated columns in data and
infer candidates for primary keys.
– Multiple column analysis can be to identify primary keys if no single
column meets the criteria for a primary key.
Advanced data profiling techniques:
• Key integrity
– ensures keys are always present in the data, using zero/blank/null
analysis.
– Also, helps identify orphan keys, which are problematic for ETL
and future analysis.
• Cardinality
– checks relationships like one-to-one, one-to-many, many-to-
many, between related data sets.
– This helps BI tools perform inner or outer joins correctly.
• Pattern and frequency distributions
– checks if data fields are formatted correctly, for example if
emails are in a valid format.
– Extremely important for data fields used for outbound
communications (emails, phone numbers, addresses).
Data Profiling Tool
Prerequisite
Before starting to look for data anomalies first
we will learn about the data quality.
• We say that the data has a high quality if it is:
– Correct - consistent with reality.
– Unambiguous - has only one meaning.
– Consistent - use only one convention to convey its
meaning.
– Complete - it has no missing values or null values.
Know your data | 1
What Types Of Analysis Are Performed?
• Completeness Analysis
– How often is a given attribute populated, versus
blank or null?
• Uniqueness Analysis
– How many unique (distinct) values are found for a
given attribute across all records? Are there
duplicates? Should there be?
Know your data | 2
• Values Distribution Analysis
– What is the distribution of records across different
values for a given attribute?
• Range Analysis
– What are the minimum, maximum, average and
median values found for a given attribute?
• Pattern Analysis
– What formats were found for a given attribute, and
what is the distribution of records across these
formats?
Benefits
The benefits of data profiling are to:
• Improve data quality
• Shorten the implementation cycle of major
projects
• Improve users' understanding of data.
• Discovering business knowledge embedded in
data itself
• Improving data accuracy in corporate databases.
Applications | 1
An example of data profiling is its relationship with
Health Tracking.
• Data is collected from apps, and other media outlets to
collect a general understanding of the health and well-
being of civilization.
• Data is collected from apps upon various concepts,
such as fitness, menstruation cycles, mental health, and
health conditions such as diabetes, cardiovascular
failure, and obesity.
• The statistics gained from these platforms are then
utilized to gain extensive multiple perspectives and
experiences from users.
Applications | 2
• This information can be used in attribution to health care
professionals to determine the most common ground on
which users stand within their health.
• It can also give a glimpse into whether utilizing the app is
improving the health of patients, and what can be done in
extent to assist.
• It allows those in health care to tailor the app to the needs of
patients, and also see if the app performs truly helps the
patient.
• Although a concern that runs within this is the tampering of
information.
• However, assuming the majority of users input correct
information, the outcome will most typically balance out.
Introduction to ETL using Kettle
Kettle |1
• Pentaho Data Integration - Kettle ETL tool
• K.E.T.T.L.E. - Kettle ETTL Environment
– has been recently aquired by the Pentaho group
and renamed to Pentaho Data Integration.
• Kettle is a leading open source ETL
application on the market.
Kettle | 2
• It is classified as an ETL tool, however the
concept of classic ETL process (extract,
transform, load) has been slightly modified in
Kettle as it is composed of four
elements, ETTL, which stands for:
– Data extraction from source databases
– Transport of the data
– Data transformation
– Loading of data into a data warehouse
Kettle Component |1
• The main components of Pentaho Data
Integration are:
– Spoon - a graphical tool which make the design of
an ETTL process transformations easy to create.
– It performs the typical data flow functions like
reading, validating, refining, transforming, writing
data to a variety of different data sources and
destinations.
Kettle Component | 2
• Tranformations designed in Spoon can be run with
Kettle Pan and Kitchen.
• Pan - is an application dedicated to run data
transformations designed in Spoon.
• Chef - a tool to create jobs which automate the database
update process in a complex way
• Kitchen - it's an application which helps execute the
jobs in a batch mode, usually using a schedule which
makes it easy to start and control the ETL processing
• Carte - a web server which allows remote monitoring
of the running Pentaho Data Integration ETL processes
through a web browser.
Supported databases in Kettle ETL
- Any database using ODBC on Windows
- Oracle
- MySQL
- MS Access
- MS SQL Server
- IBM DB2
- PostgreSQL
- Sybase
- dBase
- Firebird SQL
- MaxDB (SAP DB)
Transformation
• A Transformation is made of Steps, linked by
Hops. These Steps and Hops form paths through
which data flows.
• The proposed task will be accomplished in three
subtasks:
– Creating a new Transformation
– Designing the basic flow of the transformation, by
adding steps and hops
– Configuring the steps for the dataset and the desired
actions
Step in Kettle
• A Step is the minimal unit inside a
Transformation.
• A wide variety of Steps are available, grouped
into categories like Input and Output, among
others.
• Each Step is designed to accomplish a specific
function, such as generating a random number
or inserting rows into a database table.
Hop in Kettle
• A Hop is a graphical representation of data
flowing between two Steps, with an origin and
a destination.
• The data that flows through that Hop
constitutes the Output Data of the origin Step,
and the Input Data of the destination Step.
• More than one Hop could leave a Step
Hello World Example
• It will introduce you to:
– Working with the Spoon tool
– Transformations
– Steps and Hops
– Predefined variables
– Previewing and Executing from Spoon
– Executing Transformations from a terminal
window with the Pan tool.
Objective of Hello World Proj
I want to create an XML file containing
greetings for each person, in a CSV file.
Input
Let's suppose that we have a CSV file containing a
list of people:
last_name,name
Suarez,Maria
Guimaraes,Joao
Rush,Jennifer
Ortiz,Camila
Rodriguez,Carmen
da Silva,Zoe
Output
Desired Output:
<Rows>
<row>
<msg>Hello, Maria!</msg>
</row>
<row>
<msg>Hello, Joao!</msg>
</row>

</Rows>
Our Transformation
Our transformation has to do the following:
1. Read the CSV file
2. Build the greetings message
3. Save the greetings in the XML file
More about ETL
Various Stages in ETL
Cycle initiation
Data Mapping
Build reference data

Extract (actual data)

Validate

Transform (clear, apply business rules) Data Staging

Stage (load into staging tables)

Audit reports (success/failure log)

Publish (load into target tables)

Archive

Clean up
Various Stages in ETL
DATA MAPPING DATA STAGING

VALIDATE STAGE


REFERENCE

EXTRACT TRANSFORM

ARCHIVE

------
------ 
------

AUDIT REPORTS
PUBLISH
Extract, Transform and Load

• Extract, transform, and load (ETL) in database usage (and especially in


data warehousing) involves:
 Extracting data from different sources
 Transforming it to fit operational needs (which can include quality
levels)
 Loading it into the end target (database or data warehouse)
• Allows to create efficient and consistent databases
• While ETL can be referred in the context of a data warehouse, the term
ETL is in fact referred to as a process that loads any database.
• Usually ETL implementations store an audit trail on positive and negative
process runs.
Data Mapping

• The process of creating data element mapping


between two distinct data models
• It is used as the first step towards a wide variety of
data integration tasks
• The various method of data mapping are
Hand-coded, graphical manual
Data-driven mapping
Semantic mapping
Data Mapping

 Hand-coded, graphical manual


• Graphical tools that allow a user to “draw” lines from
fields in one set of data to fields in another
 Data-driven mapping
• Evaluating actual data values in two data sources using
heuristics and statistics to automatically discover complex
mappings
Data Mapping

 Semantic mapping
• A metadata registry can be consulted to look up data
element synonyms
• If the destination column does not match the source
column, the mappings will be made if these data elements
are listed as synonyms in the metadata registry
• Only able to discover exact matches between columns of
data and will not discover any transformation logic or
exceptions between columns
Data Staging
A data staging area is an intermediate storage area between the
sources of information and the Data Warehouse (DW) or Data Mart
(DM)

• A staging area can be used for any of the following purposes:


 Gather data from different sources at different times
 Load information from the operational database
 Find changes against current DW/DM values.
 Data cleansing
 Pre-calculate aggregates. 

Reasons for “Dirty” Data
 Dummy Values
 Absence of Data
 Multipurpose Fields
 Cryptic Data
 Contradicting Data
 Inappropriate Use of Address Lines
 Violation of Business Rules
 Reused Primary Keys,
 Non-Unique Identifiers
 Data Integration Problems
Examples
• Here’s a couple of examples:
– Dummy data -- a clerk enters 999-99-9999 as a
SSN rather than asking the customer for theirs.
– Reused primary keys -- a branch bank is closed.
Several years later, a new branch is opened, and
the old identifier is used again.
Data Cleansing
• Source systems contain “dirty data” that must be
cleansed
• ETL software contains rudimentary data cleansing
capabilities
• Specialized data cleansing software is often used.
Important for performing name and address
correction and householding functions
• Leading data cleansing vendors include Vality
(Integrity), Harte-Hanks (Trillium), and Firstlogic
(i.d.Centric)
Steps in Data Cleansing
 Parsing
 Correcting
 Standardizing
 Matching
 Consolidating
Parsing
• Parsing locates and identifies individual data
elements in the source files and then isolates
these data elements in the target files.
• Examples include parsing the first, middle, and
last name; street number and street name; and
city and state.
Correcting
• Corrects parsed individual data components
using sophisticated data algorithms and
secondary data sources.
• Example include replacing a vanity address
and adding a zip code.
Standardizing
• Standardizing applies conversion routines to
transform data into its preferred (and
consistent) format using both standard and
custom business rules.
• Examples include adding a pre name,
replacing a nickname, and using a preferred
street name.
Matching
• Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
• Examples include identifying similar names
and addresses.
Consolidating
 Analyzing and identifying relationships
between matched records and
consolidating/merging them into ONE
representation.
Data Extraction

• Extraction is the operation of extracting data from the source


system for further use in a data warehouse environment. This
the first step in the ETL process.

• Designing this process means making decisions about the


following main aspects:
 Which extraction method would I choose?
 How do I provide the extracted data for further processing?
Data Extraction (cont…)

The data has to be extracted both logically and physically.


• The logical extraction method
 Full extraction
 Incremental extraction

• The physical extraction method


 Online extraction
 Offline extraction
Full extraction
The data is extracted completely from the source
system. Since this extraction reflects all the data
currently available on the source system, there's no
need to keep track of changes to the data source
since the last successful extraction.
The source data will be provided as-is and no
additional logical information (for example,
timestamps) is necessary on the source site.
Incremental extraction

 At a specific point in time, only the data that has changed since a
well-defined event back in history will be extracted. This event
may be the last time of extraction or a more complex business
event like the last booking day of a fiscal period.
 To identify this delta change there must be a possibility to
identify all the changed information since this specific time
event. This information can be either provided by the source data
itself like an application column, reflecting the last-changed
timestamp or a change table where an appropriate additional
mechanism keeps track of the changes besides the originating
transactions.
 In most cases, using the latter method means adding extraction
logic to the source system.
Online extraction
The data is extracted directly from the source
system itself.
The extraction process can connect directly to the
source system to access the source tables
themselves or to an intermediate system that stores
the data in a preconfigured manner (for example,
snapshot logs or change tables).
The intermediate system is not necessarily
physically different from the source system.
Offline extraction
The data is not extracted directly from the source
system but is staged explicitly outside the original
source system.
The data already has an existing structure (for
example, redo logs, archive logs or transportable
tablespaces) or was created by an extraction
routine.
Data Transformation

• It is the most complex and, in terms of production the most


costly part of ETL process.
• They can range from simple data conversion to extreme data
scrubbing techniques.
• From an architectural perspective, transformations can be
performed in two ways.
 Multistage data transformation
 Pipelined data transformation
Data Transformation

LOAD INTO STAGE_01 TABLE VALIDATE STAGE_02 TABLE


STAGING CUSTOMER
KEYS
TABLE

CONVERT INSERT INTO


STAGE_03 TABLE TARGET TABLE
SOURCE KEY TO WAREHOUSE
WAREHOUSE TABLE
KEYS

MULTISTAGE TRANSFORMATION
Data Transformation

VALIDATE
EXTERNAL TABLE
CUSTOMER
KEYS

CONVERT SOURCE
INSERT INTO TARGET TABLE
KEYS TO
WAREHOUSE
WAREHOUSE KEYS
TABLE

PIPELINED TRANSFORMATION
Data Loading

• The load phase loads the data into the end target, usually the
data warehouse (DW). Depending on the requirements of the
organization, this process varies widely.

• The timing and scope to replace or append into the DW are


strategic design choices dependent on the time available and
the business needs.

• More complex systems can maintain a history and audit trail of


all changes to the data loaded in the DW.
References
• https://www.etltools.org/

Vous aimerez peut-être aussi