Vous êtes sur la page 1sur 69

what is fact less fact table? where you have used it in your project?

In Schema level we can say as when we consider these as Count of Occurances/even that
does not involved or getting aggreaged at the fact table means,we call these fact(has no
measure but with events/occurance) called as Fact less Fact table.

for eg. No : of Policy has been closed this months.

Generally we using the factless fact when we want events that happen only at information
level but not included in the calculations level.just an information about an event that
happen over a period.

Fact Table
The centralized table in a star schema is called as FACT table. A fact table typically has
two types of columns: those that contain facts and those that are foreign keys to
dimension tables. The primary key of a fact table is usually a composite key that is made
up of all of its foreign keys.

In the example fig 1.6 "Sales Dollar" is a fact(measure) and it can be added across several
dimensions. Fact tables store different types of measures like additive, non additive and
semi additive measures.

Measure Types
 Additive - Measures that can be added across all dimensions.
 Non Additive - Measures that cannot be added across all dimensions.
 Semi Additive - Measures that can be added across few dimensions and not with

A fact table might contain either detail level facts or facts that have been aggregated (fact
tables that contain aggregated facts are often instead called summary tables).

In the real world, it is possible to have a fact table that contains no measures or facts.
These tables are called as Factless Fact tables.

Steps in designing Fact Table

 Identify a business process for analysis(like sales).
 Identify measures or facts (sales dollar).
 Identify dimensions for facts(product dimension, location dimension, time
dimension, organization dimension).
 List the columns that describe each dimension.(region name, branch name, region

iGATE Internal
 Determine the lowest level of summary in a fact table(sales dollar).

Example of a Fact Table with an Additive Measure in Star Schema: Figure 1.6

In the example figure 1.6, sales fact table is connected to dimensions location, product,
time and organization. Measure "Sales Dollar" in sales fact table can be added across all
dimensions independently or in a combined manner which is explained below.

 Sales Dollar value for a particular product

 Sales Dollar value for a product in a location
 Sales Dollar value for a product in a year within a location
 Sales Dollar value for a product in a year within a location sold or serviced by an


In Snowflake schema, the example diagram shown below has 4 dimension tables, 4
lookup tables and 1 fact table. The reason is that hierarchies(category, branch, state, and
month) are being broken out of the dimension tables(PRODUCT, ORGANIZATION,
LOCATION, and TIME) respectively and shown separately. In OLAP, this Snowflake
schema approach increases the number of joins and poor performance in retrieval of data.
In few organizations, they try to normalize the dimension tables to save space. Since
dimension tables hold less space, Snowflake schema approach may be avoided.

iGATE Internal
Example of Snowflake Schema: Figure 1.7

Star schema

General Information
In general, an organization is started to earn money by selling a product or by providing
service to the product. An organization may be at one place or may have several

When we consider an example of an organization selling products throughtout the world,

the main four major dimensions are product, location, time and organization. Dimension
tables have been explained in detail under the section Dimensions. With this example, we
will try to provide detailed explanation about STAR SCHEMA.

What is Star Schema?

Star Schema is a relational database schema for representing multimensional data. It is
the simplest form of data warehouse schema that contains one or more dimensions and
fact tables. It is called a star schema because the entity-relationship diagram between
dimensions and fact tables resembles a star where one fact table is connected to multiple
dimensions. The center of the star schema consists of a large fact table and it points

iGATE Internal
towards the dimension tables. The advantage of star schema are slicing down,
performance increase and easy understanding of data.

Steps in designing Star Schema

 Identify a business process for analysis(like sales).
 Identify measures or facts (sales dollar).
 Identify dimensions for facts(product dimension, location dimension, time
dimension, organization dimension).
 List the columns that describe each dimension.(region name, branch name, region
 Determine the lowest level of summary in a fact table(sales dollar).

Important aspects of Star Schema & Snow Flake Schema

 In a star schema every dimension will have a primary key.
 In a star schema, a dimension table will not have any parent table.
 Whereas in a snow flake schema, a dimension table will have one or more parent
 Hierarchies for the dimensions are stored in the dimensional table itself in star
 Whereas hierachies are broken into separate tables in snow flake schema. These
hierachies helps to drill down the data from topmost hierachies to the lowermost


A logical structure that uses ordered levels as a means of organizing data. A hierarchy can
be used to define data aggregation; for example, in a time dimension, a hierarchy might
be used to aggregate data from the Month level to the Quarter level, from the Quarter
level to the Year level. A hierarchy can also be used to define a navigational drill path,
regardless of whether the levels in the hierarchy represent aggregated totals or not.

A position in a hierarchy. For example, a time dimension might have a hierarchy that
represents data at the Month, Quarter, and Year levels.

Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table
typically has two types of columns: those that contain facts and those that are foreign
keys to dimension tables. The primary key of a fact table is usually a composite key that
is made up of all of its foreign keys.

A fact table might contain either detail level facts or facts that have been aggregated (fact
tables that contain aggregated facts are often instead called summary tables). A fact table
usually contains facts with the same level of aggregation.

iGATE Internal
Example of Star Schema: Figure 1.6

In the example figure 1.6, sales fact table is connected to dimensions location, product,
time and organization. It shows that data can be sliced across all dimensions and again it
is possible for the data to be aggregated across multiple dimensions. "Sales Dollar" in
sales fact table can be calculated across all dimensions independently or in a combined
manner which is explained below.

 Sales Dollar value for a particular product

 Sales Dollar value for a product in a location
 Sales Dollar value for a product in a year within a location
 Sales Dollar value for a product in a year within a location sold or serviced by an

Data Warehouse & Data Mart
A data warehouse is a relational/multidimensional database that is designed for query and
analysis rather than transaction processing. A data warehouse usually contains historical
data that is derived from transaction data. It separates analysis workload from transaction
workload and enables a business to consolidate data from several sources.

n addition to a relational/multidimensional database, a data warehouse environment often

consists of an ETL solution, an OLAP engine, client analysis tools, and other applications
that manage the process of gathering data and delivering it to business users.

There are three types of data warehouses:

1. Enterprise Data Warehouse - An enterprise data warehouse provides a central
database for decision support throughout the enterprise.
2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike
the real entertprise data warehouse, data is refreshed in near real time and used for routine
business activity.

iGATE Internal
3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region,
business unit or business function.

Data warehouses and data marts are built on dimensional data modeling where fact tables
are connected with dimension tables. This is most useful for users to access data since a
database can be visualized as a cube of several dimensions. A data warehouse provides an
opportunity for slicing and dicing that cube along each of its dimensions.

Data Mart: A data mart is a subset of data warehouse that is designed for a particular
line of business, such as sales, marketing, or finance. In a dependent data mart, data can
be derived from an enterprise-wide data warehouse. In an independent data mart, data can
be collected directly from sources.

Figure 1.12 : Data Warehouse and Datamarts


Slowly Changing Dimensions

Dimensions that change over time are called Slowly Changing Dimensions. For instance,
a product price changes over time; People change their names for some reason; Country
and State names may change over time. These are a few examples of Slowly Changing
Dimensions since some changes are happening to them over a period of time.

Slowly Changing Dimensions are often categorized into three types namely Type1,
Type2 and Type3. The following section deals with how to capture and handling these
changes over time.

iGATE Internal
The "Product" table mentioned below contains a product named, Product1 with Product
ID being the primary key. In the year 2004, the price of Product1 was $150 and over the
time, Product1's price changes from $150 to $350. With this information, let us explain
the three types of Slowly Changing Dimensions.

Product Price in 2004:

Product ID(PK) Year Product Name Product Price

1 2004 Product1 $150

Type 1: Overwriting the old values.

In the year 2005, if the price of the product changes to $250, then the old values of the
columns "Year" and "Product Price" have to be updated and replaced with the new
values. In this Type 1, there is no way to find out the old value of the product "Product1"
in year 2004 since the table now contains only the new price and year information.


Product ID(PK) Year Product Name Product Price

1 2005 Product1 $250

Type 2: Creating an another additional record.

In this Type 2, the old values will not be replaced but a new row containing the new
values will be added to the product table. So at any point of time, the difference between
the old values and new values can be retrieved and easily be compared. This would be
very useful for reporting purposes.


Product ID(PK) Year Product Name Product Price

1 2004 Product1 $150
1 2005 Product1 $250

The problem with the above mentioned data structure is "Product ID" cannot store
duplicate values of "Product1" since "Product ID" is the primary key. Also, the current
data structure doesn't clearly specify the effective date and expiry date of Product1 like
when the change to its price happened. So, it would be better to change the current data
structure to overcome the above primary key violation.


iGATE Internal
Product Effective Product Product Expiry
ID(PK) DateTime(PK) Name Price DateTime
01-01-2004 12-31-2004
1 2004 Product1 $150
12.00AM 11.59PM
1 2005 Product1 $250

In the changed Product table's Data structure, "Product ID" and "Effective DateTime" are
composite primary keys. So there would be no violation of primary key constraint.
Addition of new columns, "Effective DateTime" and "Expiry DateTime" provides the
information about the product's effective date and expiry date which adds more clarity
and enhances the scope of this table. Type2 approach may need additional space in the
data base, since for every changed record, an additional row has to be stored. Since
dimensions are not that big in the real world, additional space is negligible.

Type 3: Creating new fields.

In this Type 3, the latest update to the changed values can be seen. Example mentioned
below illustrates how to add new columns and keep track of the changes. From that, we
are able to see the current price and the previous price of the product, Product1.


Current Old Product

Product Current
Product ID(PK) Old Year
Name Product Price
Year Price
1 2005 Product1 $250 $150 2004

The problem with the Type 3 approach, is over years, if the product price continuously
changes, then the complete history may not be stored, only the latest change will be
stored. For example, in year 2006, if the product1's price changes to $350, then we would
not be able to see the complete history of 2004 prices, since the old values would have
been updated with 2005 product information.


Product Old Product

Product ID(PK) Year Old Year
Price Price
1 2006 Product1 $350 $250 2005


Time Dimension
In a relational data model, for normalization purposes, year lookup, quarter lookup,

iGATE Internal
month lookup, and week lookups are not merged as a single table. In a dimensional data
modeling(star schema), these tables would be merged as a single table called TIME
DIMENSION for performance and slicing data.

This dimensions helps to find the sales done on date, weekly, monthly and yearly basis.
We can have a trend analysis by comparing this year sales with the previous year or this
week sales with the previous week.

Example of Time Dimension: Figure 1.11

Year Lookup

Year Id Year Number DateTimeStamp

1 2004 1/1/2005 11:23:31 AM
2 2005 1/1/2005 11:23:31 AM

Quarter Lookup

Quarter Number Quarter Name DateTimeStamp

1 Q1 1/1/2005 11:23:31 AM
2 Q2 1/1/2005 11:23:31 AM
3 Q3 1/1/2005 11:23:31 AM
4 Q4 1/1/2005 11:23:31 AM

Month Lookup

Month Number Month Name DateTimeStamp

1 January 1/1/2005 11:23:31 AM
2 February 1/1/2005 11:23:31 AM
3 March 1/1/2005 11:23:31 AM
4 April 1/1/2005 11:23:31 AM

iGATE Internal
5 May 1/1/2005 11:23:31 AM
6 June 1/1/2005 11:23:31 AM
7 July 1/1/2005 11:23:31 AM
8 August 1/1/2005 11:23:31 AM
9 September 1/1/2005 11:23:31 AM
10 October 1/1/2005 11:23:31 AM
11 November 1/1/2005 11:23:31 AM
12 December 1/1/2005 11:23:31 AM

Week Lookup

Week Number Day of Week DateTimeStamp

1 Sunday 1/1/2005 11:23:31 AM
1 Monday 1/1/2005 11:23:31 AM
1 Tuesday 1/1/2005 11:23:31 AM
1 Wednesday 1/1/2005 11:23:31 AM
1 Thursday 1/1/2005 11:23:31 AM
1 Friday 1/1/2005 11:23:31 AM
1 Saturday 1/1/2005 11:23:31 AM
2 Sunday 1/1/2005 11:23:31 AM
2 Monday 1/1/2005 11:23:31 AM
2 Tuesday 1/1/2005 11:23:31 AM
2 Wednesday 1/1/2005 11:23:31 AM
2 Thursday 1/1/2005 11:23:31 AM
2 Friday 1/1/2005 11:23:31 AM
2 Saturday 1/1/2005 11:23:31 AM

Time Dimension

Time Day Month Day

Year Quarter Month Month Week DateTime
Dim Of Day of Cal Date
No No No Name No Stamp
Id Year No Week
1 2004 1 Q1 1 January 1 1 5 1/1/2004 11:23:31
2 2004 32 Q1 2 February 1 5 1 2/1/2004 11:23:31
3 2005 1 Q1 1 January 1 1 7 1/1/2005 11:23:31

iGATE Internal
4 2005 32 Q1 2 February 1 5 3 2/1/2005 11:23:31

Organization Dimension
In a relational data model, for normalization purposes, corporate office lookup, region
lookup, branch lookup, and employee lookups are not merged as a single table. In a
dimensional data modeling(star schema), these tables would be merged as a single table
called ORGANIZATION DIMENSION for performance and slicing data.

This dimension helps us to find the products sold or serviced within the organization by
the employees. In any industry, we can calculate the sales on region basis, branch basis
and employee basis. Based on the performance, an organization can provide incentives to
employees and subsidies to the branches to increase further sales.

Example of Organization Dimension: Figure 1.10

Product Dimension
In a relational data model, for normalization purposes, product category lookup, product
sub-category lookup, product lookup, and and product feature lookups are are not merged
as a single table. In a dimensional data modeling(star schema), these tables would be
merged as a single table called PRODUCT DIMENSION for performance and slicing
data requirements.

iGATE Internal
Example of Product Dimension: Figure 1.9

Dimension Table
Dimension table is one that describe the business entities of an enterprise, represented as
hierarchical, categorical information such as time, departments, locations, and products.
Dimension tables are sometimes called lookup or reference tables.

Location Dimension
In a relational data modeling, for normalization purposes, country lookup, state lookup,
county lookup, and city lookups are not merged as a single table. In a dimensional data
modeling(star schema), these tables would be merged as a single table called LOCATION
DIMENSION for performance and slicing data requirements. This location dimension
helps to compare the sales in one region with another region. We may see good sales
profit in one region and loss in another region. If it is a loss, the reasons for that may be a
new competitor in that area, or failure of our marketing strategy etc.

Example of Location Dimension: Figure 1.8

iGATE Internal
Relational Data Modeling is used in OLTP systems which are transaction oriented and
Dimensional Data Modeling is used in OLAP systems which are analytical based. In a
data warehouse environment, staging area is designed on OLTP concepts, since data has
to be normalized, cleansed and profiled before loaded into a data warehouse or data mart.
In OLTP environment, lookups are stored as independent tables in detail whereas these
independent tables are merged as a single dimension in an OLAP environment like data

Relational vs Dimensional
Relational Data Modeling Dimensional Data Modeling
Data is stored in RDBMS or Multidimensional
Data is stored in RDBMS
Tables are units of storage Cubes are units of storage
Data is normalized and used for OLTP. Data is denormalized and used in datawarehouse
Optimized for OLTP processing and data mart. Optimized for OLAP
Several tables and chains of Few tables and fact tables are connected to
relationships among them dimensional tables
Volatile(several updates) and time
Non volatile and time invariant
SQL is used to manipulate data MDX is used to manipulate data
Summary of bulky transactional data(Aggregates
Detailed level of transactional data
and Measures) used in business decisions
User friendly, interactive, drag and drop
Normal Reports
multidimensional OLAP Reports

Dimensional Data Modeling

Dimensional Data Modeling comprises of one or more dimension tables and fact tables.
Good examples of dimensions are location, product, time, promotion, organization etc.
Dimension tables store records related to that particular dimension and no
facts(measures) are stored in these tables.

For example, Product dimension table will store information about products(Product
Category, Product Sub Category, Product and Product Features) and location dimension
table will store information about location( country, state, county, city, zip. A
fact(measure) table contains measures(sales gross value, total units sold) and dimension
columns. These dimension columns are actually foreign keys from the respective
dimension tables.

iGATE Internal
Example of Dimensional Data Model: Figure 1.6

In the example figure 1.6, sales fact table is connected to dimensions location, product,
time and organization. It shows that data can be sliced across all dimensions and again it
is possible for the data to be aggregated across multiple dimensions. "Sales Dollar" in
sales fact table can be calculated across all dimensions independently or in a combined
manner which is explained below.

 Sales Dollar value for a particular product

 Sales Dollar value for a product in a location
 Sales Dollar value for a product in a year within a location
 Sales Dollar value for a product in a year within a location sold or serviced by an

In Dimensional data modeling, hierarchies for the dimensions are stored in the
dimensional table itself. For example, the location dimension will have all of its
hierarchies from country, state, county to city. There is no need for the individual
hierarchial lookup like country lookup, state lookup, county lookup and city lookup to be
shown in the model.

Uses of Dimensional Data Modeling

Dimensional Data Modeling is used for calculating summarized data. For example, sales
data could be collected on a daily basis and then be aggregated to the week level, the
week data could be aggregated to the month level, and so on. The data can then be
referred to as aggregate data. Aggregation is synonymous with summarization, and
aggregate data is synonymous with summary data. The performance of dimensional data
modeling can be significantly increased when materialized views are used. Materialized
view is a pre-computed table comprising aggregated or joined data from fact and possibly
dimension tables which also known as a summary or aggregate table.

iGATE Internal
Logical vs Physical Data Modeling
Logical Data Model Physical Data Model
Represents business information and Represents the physical implementation of the
defines business rules model in a database.
Entity Table
Attribute Column
Primary Key Primary Key Constraint
Alternate Key Unique Constraint or Unique Index
Inversion Key Entry Non Unique Index
Rule Check Constraint, Default Value
Relationship Foreign Key
Definition Comment

Data Modeler Role

Business Requirement Analysis:

» Interact with Business Analysts to get the functional requirements.
» Interact with end users and find out the reporting needs.
» Conduct interviews, brain storming discussions with project team to get additional
» Gather accurate data by data analysis and functional analysis.

Development of data model:

» Create standard abbreviation document for logical, physical and dimensional data
» Create logical, physical and dimensional data models(data warehouse data modelling).
» Document logical, physical and dimensional data models (data warehouse data

» Generate reports from data model.

» Review the data model with functional and technical team.

Creation of database:
» Create sql code from data model and co-ordinate with DBAs to create database.
» Check to see data models and databases are in synch.

iGATE Internal
Support & Maintenance:
» Assist developers, ETL, BI team and end users to understand the data model.
» Maintain change log for each data model.

Steps to create a Data Model

These are the general guidelines to create a standard data model and in real time, a data
model may not be created in the same sequential manner as shown below. Based on the
enterprise’s requirements, some of the steps may be excluded or included in addition to

Sometimes, data modeler may be asked to develop a data model based on the existing
database. In that situation, the data modeler has to reverse engineer the database and
create a data model.

1» Get Business requirements.

2» Create High Level Conceptual Data Model.
3» Create Logical Data Model.
4» Select target DBMS where data modeling tool creates the physical schema.
5» Create standard abbreviation document according to business standard.
6» Create domain.
7» Create Entity and add definitions.
8» Create attribute and add definitions.
9» Based on the analysis, try to create surrogate keys, super types and sub types.
10» Assign datatype to attribute. If a domain is already present then the attribute should
be attached to the domain.
11» Create primary or unique keys to attribute.
12» Create check constraint or default to attribute.
13» Create unique index or bitmap index to attribute.
14» Create foreign key relationship between entities.
15» Create Physical Data Model.
15» Add database properties to physical data model.
16» Create SQL Scripts from Physical Data Model and forward that to DBA.
17» Maintain Logical & Physical Data Model.
18» For each release (version of the data model), try to compare the present version with
the previous version of the data model. Similarly, try to compare the data model with the
database to find out the differences.
19» Create a change log document for differences between the current version and
previous version of the data model.


iGATE Internal

Informatica is a widely used ETL tool for extracting the source data and loading it into
the target after applying the required transformation. In the following section, we will try
to explain the usage of Informatica in the Data Warehouse environment with an example.
Here we are not going into the details of data warehouse design and this tutorial simply
provides the overview about how INFORMATICA can be used as an ETL tool.

Example - Stock Trading:

Note: The exchanges/companies that are explained here is for illustrative purpose only.

Bombay Stock Exchange (BSE) and National Stock Exchange (NSE) are two major stock
exchanges in India in which the shares of ABC Corporation and XYZ Private Limited are
traded between Mondays through Friday except Holidays. Assume that a software
company “KLXY Limited” has taken the project to integrate the data between two
exchanges BSE and NSE.

ETL Process - Roles & Responsibilities:

In order to complete this task of integrating the Raw data received from NSE & BSE,
KLXY Limited allots responsibilities to Data Modelers, DBAs and ETL Developers.
During this entire ETL process, many IT professionals may involve, but we are
highlighting the roles of these three personals only for easy understanding and better

 Data Modelers analyze the data from these two sources(Record Layout 1 &
Record Layout 2), design Data Models, and then generate scripts to create
necessary tables and the corresponding records.
 DBAs create the databases and tables based on the scripts generated by the data
 ETL developers map the extracted data from source systems and load it to target
systems after applying the required transformations.

Overall Process:

The complete process of data transformation from external sources to our target data
warehouse is explained using the following sections. Each section will be explained in

 Data from the external sources (source1 - .CSV (comma seperated) file , source2 -
Oracle table)

iGATE Internal
 Source(s) table layout details
 Look up table details
 Target table layout details
 Defining Source table and target table in Informatica
 Implementing extraction mapping in Informatica (Mapping Designer)
 Implementing transformation and loading mapping in Informatica
 Workflow creation in Informatica (Workflow Manager)
 Verifying records through Informatica (Workflow Monitor)

 http://learnbi.com/informatica2.htm

ETL Testing:
Testing is an important phase in the project lifecycle. A structured well defined testing
methodology involving comprehensive unit testing and system testing not only ensures
smooth transition to the production environment but also a system without defects.

iGATE Internal
The testing phase can be broadly classified into the following categories:

 Integration Testing
 System Testing
 Regression Testing
 Performance Testing
 Operational Qualification

Test Strategy:

A test strategy is an outline that describes the test plan. It is created to inform the project
team the objective, high level scope of the testin process. This includes the testing
objective, methods of testing, resources, estimated timelines, environment etc.

The test strategy is created based on high level design document.For each testing
component test strategy needs to be created. based on this strategy testing process will be
detailed out in the test plan.

Test Planning:

Test Planning is a key for successfully implementing the testing of a system. The
deliverable is the actual �Test Plan�. A software project test plan is a document that
describes the Purpose, System Overview, Approach to Testing, Test Planning, Defect
Tracking, Test Environment, Test prerequisites and References.

A key prerequisite for preparing a successful Test Plan is having approved (functional and
non functional) requirements. Without the frozen requirements and specification the test
plan will result in the lack of validation for the projects testing efforts.

The process of preparing a test plan is a useful way to get to know how testing of a
particular system can be carried out within the given time line provided the test plan
should be thorough enough.

The test plan outlines and defines the strategy and approach taken to perform end-to-end
testing on a given project. The test plan describes the tasks, schedules, resources, and
tools for integrating and testing the software application. It is intended for use by project
personnel in understanding and carrying out prescribed test activities and in managing
these activities through successful completion.

The test plan objectives are as follows:

 To define a testing approach, scope, out of scope and methodology that

encompasses integration testing, system testing, performance testing and
regression testing in one plan for the business and project team.

iGATE Internal
 To verify the functional and non functional requirements are met.
 To coordinate resources, environments into an integrated schedule.
 To provide a plan that outlines the contents of detailed test cases scenarios for
each of the four phases of testing.
 To determine a process for communicating issues resulting from the test phase.

The contents of a typical test plan consist of the following:

 An introduction that includes a purpose, Definition & Acronym, Assumptions &

Dependencies, In scope, Out of scope, Roles& Responsibilities and contacts. This
information is obtained from the requirements specification.
 System Overview will explain about the background and the system description.
 A test approach for all testing levels includes test Objectives for each level, Test
responsibilities, Levels of testing, various testing, Test coverage, Testing tools,
Test data and Test stop criteria.
 Test planning specifies Test schedule, Documentation deliverables, Test
communication and Critical and High risk functions.

The test plan, thus, summarizes and consolidates information that is necessary for the
efficient and effective conduct of testing. Design Specification, Requirement Document
and Project plan supporting the finalization of testing are located in separate documents
and are referenced in the test plan.

Test Estimation:

Effective software project estimation is one of the most challenging and important
activities in the testing activities. However, it is the most essential part of a proper project
planning and control. Under-estimating a project leads to under-staffing it, running the
risk of low quality deliverables and resulting in loss of credibility as deadlines are
missed. So it is imperative to do a proper estimation during the planning stage.

The basic steps in estimation include:

 Estimating the size of the system to be tested.

 Estimating the effort in person-hours (1 person hours � number of working hour
in a day i.e. 8 hours)

After receiving the requirements the tester analyses the mappings that are
created/modified and study about the changes made. Based on the impact analysis, the
tester comes to know about how much time is needed for the whole testing process,
which consists of mapping analysis, test case preparation, test execution, defect reporting,
regression testing and final documentation. This calculated time is entered in the
Estimation time sheet.

iGATE Internal
Integration Testing:

Integration testing is to verify the required functionality of a mapping (Single ETL /

single session) in the environment specific to testing team (Test environment). This
testing should ensure that correct numbers of rows (validated records) are transferred to
the target from the source.

Integration testing is also used to verify initialization and incremental mappings

(sessions) functionality along with the pre-session and post-session scripts for
dependencies and the usage/consumption of relative indicator files to control
dependencies across multiple work streams (modules). During integration testing error-
handling processes, proper functionality of mapping variables and the appropriate
business requirements can be validated.


 Access to the required folders on the network.

 Implementation Checklist for move from development to test.
 All unit testing completed and summarized.
 Data available in the test environment.
 Migration to the test environment from the development environment.

System Testing:

This environment integrates the various components and runs as a single unit. This
should include sequences of events that enable those different components to run as a
single unit and validate the data flow.

 Verify all the required functionality in the validation environment.

 Run end-to-end system test.
 Record initialization and incremental load statistics.
 Perform and mitigate performance of the entire system.
 Verify error-handling processes are working as designed.


 Finalized Implementation Checklist.

 All integration testing complete.
 Migration from the Test environment to the validation environment, as applicable.
 Production configuration and data available.

Regression Testing:

Regression Testing is performed after the developer fixes a defect reported. This testing is
to verify whether the identified defects are fixed and fixing of these defects does not
introduce any other new defects in the system / application. This testing will also be

iGATE Internal
performed when a Change Request is implemented on an existing production system.
After the Change Request (CR) is approved, the testing team takes the impact analysis as
input for designing the test cases for the CR.


 Finalized Implementation Checklist.

 All integration testing complete.

Performance Testing:

To determine the system performance under a particular workload / Service Level

Agreement (SLA).Ensures system meets the performance criteria and it can detect bottle

Types of Performance Testing are Load, Stress, Volume etc.

PUSH DOWN Optimization

I'm going to work on an ETL project -where source is oracle ,target is teradata and ETL
tool to be used is informatica.

There are two levels -one is load into staging(staging is also teradata) and second is
loading into target tables.
I query the oracle source tables and load into staging area.
Which of the approach is good -
1.create a one to one mapping to do this or
2.Use any of the tools offered by Teradata -like Mload,Tpump,etc in informatica and do

Please tell me the Pros and Cons of these two approaches.

I've been told to use the first method(one to one mappings).

Please advice on the second level as well ( from staging to target) whether to use one to
one mapping or teradata tools.

I'm really afraid because there is an automatic preimary index getting created in Teradata
tables and this lead to rejection of records in some cases.

Ans : For phase 1 you can either use informatica mapping using teradata loader
connections (like fastload,tpump,multiload) or directly use teradata loader scripts to load
data into staging table. Its preferable to use informatica with loader connections which
will be faster for develpment, and will only be one to one mapping.

For phase 2 stage to target, you must go for informatica and use the joins in case if there
is any. You'll have to write sql override queries which would replace the SCD

iGATE Internal
transformations in informatica. The query must distinguish the new records and update
records using a flag and do the operation of insert and update according to the flag value.

There is an option in informatica (pushdown optimization), which must be used when

you go for complex override queries.

Let me know if you have more doubt.

In case of primary index, it is the lifeline of a teradata table. You need to provide the
name of the primary index while creating a table, else it automatically takes the first
column as the primary index. Using the PI only the data gets loaded into TD as well as
retrieval is done using the PI only.

Phase 1:

To make things more clearer-you has suggested me use loader connection instead of
Relational Connection.

As indicated by you,does the loader connection will be faster than the relational

I hope this would help me in the one-time load/history load as the historical data is
provided in the form of oracle dumps.

But in the incremental loads- the source data is being pulled from the oracle database's
views(obviously there will be performance issues- but clients requirement is this way-
can't help them)- do you suggest again here as well the loader connection will be faster
than the relational connection?

Phase 2:
I'm not very clear in the explanation given by you here.

As highlighted by you,I'll be joining staging tables to load into targets.

I'll be using a lookup transformation to check whether the record is present or not.

Accordingly ,I'll be inserting or updating the target tables(by Tagging them accordingly)
using Update Statergy transformation.

Ofcourse I'll be maintaing an history by giving an "end-date" to the already existing

record - when update is done in the database and insert a new one with a sysdate as "start-

Basically I'm using the lookup and update statergy transfoemation to achieve my SCD
concept here.

iGATE Internal
Is this the concept you have explained here?

New Question:

Please explain me what is this "pushdown optimization" in informatica?

The concept is the same, but the implementation process is different. You can do it in the
way you have described using US n lookup's , also the way I have mentioned in the
previous post. The SCD can be implemented with over ride querie for faster development
instead of going for transformers like lookup n US. When using queries for SCD joins
will be replacing lookups and US will be replaced by a flag column which says whether
the record is gonna be insert/update provided the granularity of record change is 1 day.

About extracting from Oracle views to teradata you'll have to go for relational
connections only. Loaders can only load data from flatfiles to Teradata.

Informatica Push down optimization is a new concept embedded into Informatica 8
Version series.

Pushdown optimization is a concept where you can try to make most of the calculations
on the database side than doing it at the informatica level. For example if you need some
kind of aggregation, you can push those computations to be done at the database end. It is
basically to utilize the power of database, so you do it at source side as well as target side.

Take a look at pdf from wisdomforce page: Best Practices for high-speed data transfer
from Oracle to Teradata

 http://dwhtechstudy.com/Docs/informatica/New_Features_Enhancements_PC8.pdf

= PUSH DOWN Optimization

We can push transformation logic to the source or target database when the Lookup
transformation contains a lookup override.
To perform source-side ,target-side , or full pushdown optimization for a session
containing lookup overrides , configure the session for pushdown optimization and select
a pushdown option that allows us to create views.We can also use full pushdown
optimization when we use the target loading option to treat source rows as delete.



iGATE Internal


New features of Informatica 8

The following summarizes changes to PowerCenter domains that might affect
upgrades to PowerCenter 8

*Object Permissions*

Effective in version 8.1.1, you can assign object permissions to users when
you add a user account, set user permissions, or edit an object.

*Gateway and Logging Configuration*

Effective in version 8.1, you configure the gateway node and location for
log event files on the Properties tab for the domain. Log events describe
operations and error messages for core and application services, workflows
and sessions.

Log Manager runs on the master gateway node in a domain.

We can configure the maximum size of logs for automatic purge in megabytes.

Powercenter 8.1 also provides enhancements to the Log Viewer and log event

*Unicode compliance*

Effective in version 8.1, all fields in the Administration Console accept

Unicode characters. One can choose UTF-8 character set as the repository
code page to store multiple languages.

*Memory and CPU Resource Usage*

You may notice an increase in memory and CPU resource usage on machines
running PowerCenter Services.

*Domain Configuration Database*

PowerCenter stores the domain configuration in a database.

iGATE Internal
*License Usage*

Effective in version 8.0, the Service Manager registers license information.

*High Availability* HA

High availability is the PowerCenter option that eliminates a single point

of failure in the PowerCenter environment and provides minimal service
interruption in the event of failure. High availability provides the
following functionality:

*Resilience. *Resilience is the ability for services to tolerate transient

failures, such as loss of connectivity to the database or network failures.

*Failover. *Failover is the migration of a service process or task to

another node when the node running the service process becomes unavailable.

*Recovery. *Recovery is the automatic or manual completion of tasks after an

application service is interrupted.


*Web services for Administration*

The Administration Console is a browser-based utility that enables you to

view domain properties and perform basic domain administration tasks, such
as adding domain users, deleting nodes, and creating services.

*Repository Security*

Effective in version 8.1, PowerCenter uses a more robust encryption


Also provides advanced purges for purging obsolete versions of repository



There is database partitioning available. Apart from this we can configure

dynamic partitioning based on nodes in a grid, the number of partitions in
the source table and the number of partitions option. The session creates
additional partitions and attributes at the run time.


The recovery of workflow, session and task are more robust now. The state of

iGATE Internal
the workflow/session is now stored in the shared file system and not in


We have options to have partitioned FTP targets and Indirect FTP file
source(with file list).


Pushdown optimization

Uses increased performance by pushing transformation logic to the database

by analyzing the transformations and issuing SQL statements to sources and
targets. Only processes any transformation logic that it cannot push to the

*Flat file performance*

We can create more efficient data conversion using the new version.

We can concurrently write to multiple files in a session with partitioned


One can specify a command for source or target file in a session. The
command can be to create a source file like 'cat a file'.

Append to flat file target option is available now.


New features added to pmcmd. We can use infacmd to administer Powercenter


Infasetup program to setup domain and node properties.

*Mappings *

We can now build custom transformation enhancements in API using c++ and
Java code.

We can use partition threads to specific custom transformations with

partition enhancements.

New Java transformation is added which provides simple native programming

interface to define transformation with Java language.

iGATE Internal
User defined functions similar to macros in Excel files. Some new functions




Lesson three creating a Pass-Through Mapping,Session and


In this lesson, create very simple mapping and execute by session and workflow.
Mapping is based on last lesson source and target definitions. Between source and target
definition is source qualifier.

Source Qualifier
- When you add a relational or a flat file source definition to a mapping, you need to
connect it to a Source Qualifier transformation. The Source Qualifier transformation
represents the rows that the PowerCenter Server reads when it runs a sesson.
- Source Qualifier transformation can perform the following tasks:
 Join data originating from the same source database
 Filter rows when the PowerCenter Server reads source data
 Specify sorted ports
 Select only distinct values from the source
 Create a custom query to issue a special SELECT statement for the PowerCenter
Server to read source data.

Session and Workflow for this lesson are very simple. Looks like most
logic are in Mapping. Session just like diagram to link all Mapping.
Workflow is just runtime instance of session.

Lesson Two, Source , Target Definitions and Target Tables

Creating Source Definitions:

In Source Analyzer, to create a source definition you have to have a source--Database ,
Flat File or XML.... So you can Import from source and edit it, but seems you can't create
by your own. For edit, you can add , delete and update columns, and add some meta data
to each table definition.

Creating Target Definitions:

In Target Designer, to create a target definition, you can drag and drop from source
definitions table list or create table by your self. Nothing special just like create a normal
table. But remember select correct database type.

iGATE Internal
Creating Target Tables:
In target Designer, you can create table in database and informatica SQL statement from
your target table definitions.
1. Click Targets > Generate/Execute SQL.
2. In the File Name field, enter the SQL DDL file name.
3. Select the Create Table, Drop Table, Foreign Key and Primary Key options.
4. Click the Generate and Execute button.

So from what I did and see , it's simple to create source and target
definition. Just like create new tables in database. One thing I don't
like, after important source definition from database. The layout view
is not well arranged like other database logic view. you have to
rearrange them by yourself. So many software can do this job
automatically, I don't know why PowerCenter Designer can't do this.

Lesson One, Respository Users, Groups and Folder

This lesson is pretty simple just show you how to connect to repository. My
environment was already setup by my company's administrator. So I do need
worry about install and setup repository server and service.

If you don't have environment, you have to install server and do configuration
them by yourself. I tale looked that part, just need very powerful machine. So I
just ignore that.

For this lesson, includes connect to a informatica repository (domain, server, port
, username and password) , create groups and create a folder. Setup permission
on that created folder. All these did in Informatica PowerCenter Repository
Manager. Nothing special, everybody can under stand it easily.

And in this lesson also includes creating source tables and data for this tutorial.
You need have a database and execute appropriate SQL file in PowerCenter
Designer. I used smpl_ms.sql for SQL Server. This SQL file includes table’s
schema and data. Informatica PowerCenter connects database target or source
by ODBC. I installed MS SQL Server Express in my local box. It's free and
seems ok, doesn’t eat too much resources compare to oracle. When you setup
ODBC, server is "locahost\sqlexpress".

After SQL execute you can see few new tables with some data in sql
server mater database.

Informatica PowerCenter Architecture

iGATE Internal
Before the 6 lessons. I must understand of PowerCenter Architecture. This is not copy
and paste from help. I try to write down by my own under stand. It's not big deal for
developer view. But may very help in job interview.

Domain and Service Architecture

- Domain:
 Domain is collection of nodes and services.
 Primary unit to centralize administration.

- Node:
 Logical representation of a machine in a domain.
 One node i each domain servers as a gateway for the domain.
 All processes in PowerCenter run as services on a node.

- Services:
 Two type of services: Core services and Application services.
 Core services: support the domain and application services. E.g. Domain service,
Log service.
 Application services: represent PowerCenter server-based functionality. E.g.
Repository service, Intergration service...

- Informatica Repository:
 Contains a set of metadata tables withing repository database that informatica
applications and tools access.

- Informatica Client:
 Manages users, define sources and targets, builds mappings and mapplets with
transformation logic, and create workflows to run the mapping logic. The
Informatica Client has four client applications: Repository Manager, Designer,
Workflow Manager and Workflow monitor.

Posted by Informatica beginner at 8:07 AM 0 comments

What's my plan
1. First follow the tutorial from Powercenter Help .
This includes totally 6 lessons.

2. If I can get Informatica PowerCenter 8 Developer training, I'll follow training agenda
I'm trying to get a on site training combine level one and two together.

PowerCenter 8 – Level I Developer


 Data Integration Concepts

o Data Integration

iGATE Internal
o Mapping and Transformations

o Tasks and Workflows

o Metadata

 PowerCenter Components and User Interface

o PowerCenter Architecture

o PowerCenter Client Tools

o Lab - Using the Designer and Workflow Manager

 Source Qualifier

o Source Qualifier Transformation

o Velocity Methodology

o Lab Project Overview

o Lab A - Load Payment Staging Table

o Source Qualifier Joins

o Lab B - Load Product Staging Table

o Source Pipelines

o Lab C - Load Dealership and Promotions Staging Table

 Expression, Filter, File Lists and Workflow Scheduler

o Expression Editor

o Filter Transformation

o File Lists

o Workflow Scheduler

o Lab - Load the Customer Staging Table

 Joins, Features and Techniques I

o Joiner Transformation

o Shortcuts

o Lab A - Load Sales Transaction Staging Table

iGATE Internal
o Lab B - Features and Techniques I

 Lookups and Reusable Transformations

o Lookup Transformation

o Reusable Transformations

o Lab A - Load Employee Staging Table

o Lab B - Load Date Staging Table

 Debugger

o Debugging Mappings

o Lab - Using the Debugger

 Sequence Generator

o Sequence Generator Transformation

o Lab - Load Date Dimension Table

 Lookup Caching, More Features and Techniques

o Lookup Caching

o Lab A - Load Promotions Dimension Table

o Lab B - Features and Techniques II

 Sorter, Aggregator and Self-Join

o Sorter Transformation

o Aggregator Transformation

o Active and Passive Transformations

o Data Concatenation

o Self-Join

o Lab - Reload the Employee Staging Table

 Router, Update Strategy and Overrides

o Router Transformation

o Update Strategy Transformation

iGATE Internal
o Expression Default Values

o Source Qualifier Override

o Target Override

o Session Task Mapping Overrides

o Lab - Load Employee Dimension Table

 Dynamic Lookup and Error Logging

o Dynamic Lookup

o Error Logging

o Lab - Load Customer Dimension Table

 Unconnected Lookup, Parameters and Variables

o Unconnected Lookup Transformation

o System Variables

o Mapping Parameters and Variables

o Lab - Load Sales Fact Table

 Mapplets

o Mapplets

o Lab - Load Product Daily Aggregate Table

 Mapping Design

o Designing Mappings

o Workshop

 Workflow Variables and Tasks

o Link Conditions

o Workflow Variables

o Assignment Task

o Decision Task

o Email Task

iGATE Internal
o Lab - Load Product Weekly Aggregate Table

 More Tasks and Reusability

o Event Raise Task

o Event Wait Task

o Command Task

o Reusable Tasks

o Reusable Session Task

o Reusable Session Configuration

o PMCMD Utility

 Worklets and More Tasks

o Worklets

o Timer Task

o Control Task

o Lab - Load Inventory Fact Table

 Workflow Design

o Designing Workflows

o Workshop (Optional)

PowerCenter 8 – Level II Developer


 Architecture Overview and High Availability

o Architectural overview

o Domains, nodes, and services

o Administration Console

o Configuring services

o High Availability

 Mapping and Session Techniques

iGATE Internal
o Mapping parameters and variables and parameter files

o File lists

o Dynamic lookup cache

o Data driven aggregation

o Incremental aggregation

o Denormalization

 Workflow Techniques

o Using Tasks

o Workflow Control and Restart

o Workflow Alerts

o Dynamic Scheduling

o Pseudo-looping techniques

 Workflow Recovery

o Workflow recovery principles

o Task recovery strategy

o Workflow recovery options

o State of operation

o Resume recovery strategy

o Recovery using the command line

 Transaction Control

o Database Transactions

o Transaction Control transformation

o PowerCenter transaction control options

o Transformation scope

 Error Handling

o Error categories

iGATE Internal
o Error logging

o Error handling strategies

 Object Queries, Object Migration, and Comparing Objects

o Creating object queries

o Migration

o Comparing objects

o Repository Reporting

o Metadata Reports

o Repository reports

 Memory Allocation

o Optimizing session memory

o Optimizing transformation caches

o Auto-cache sizing

 Performance Tuning Methodology

o Session dynamics

o Measuring performance

o Testing for bottlenecks

o Optimization techniques

 Pipeline Partitioning

o Pipeline types

o Multi-partition sessions

o Partition points and types

o Using dynamic partitioning

Posted by Informatica beginner at 8:00 AM 0 comments

iGATE Internal
Informatica Resources
I really can't find many useful website except Informatica own website.
But you have to be a partner or customer to access these site or have to buy documents.
The lucky thing is I found some Informatica Powercenter 7 documents are free in

Anyway, all I get is just help in Powercenter 8.

Here is the list:

Informatica Customer Portal

As an Informatica customer, you can access the Informatica Customer Portal site at
http://my.informatica.com. The site contains product information, user group information,
newsletters, access to the Informatica customer support case management system
(ATLAS), the Informatica Knowledge Base, and access to the Informatica user

Informatica Web Site

You can access the Informatica corporate web site at http://www.informatica.com. The
site contains information about Informatica, its background, upcoming events, and sales
offices. You will also find product and partner information. The services area of the site
includes important information about technical support, training and education, and
implementation services.

Informatica Developer Network

You can access the Informatica Developer Network at http://devnet.informatica.com. The

Informatica Developer Network is a web-based forum for third-party software
developers. The site contains information about how to create, market, and support
customer-oriented add-on solutions based on interoperability interfaces for Informatica

Informatica Knowledge Base

As an Informatica customer, you can access the Informatica Knowledge Base at

http://my.informatica.com. Use the Knowledge Base to search for documented solutions
to known technical issues about Informatica products. You can also find answers to
frequently asked questions, technical white papers, and technical tips.

Posted by Informatica beginner at 7:46 AM 0 comments

iGATE Internal
Why I create this blog
I'm J2EE developer now, I'm interesting to become a Informatica Powercenter ETL
developer now. When I google online I can't find many useful website or article. So I
thing maybe create a blog to record my study trace and later may create a Informatica
Powercenter website.

First I'll follow tutorial in Informatica Powerceneter help as start point, after that I don't
know yet. Hope I can find something to continue or put it based on real project in my job.



Informatica PowerCenter 8x Key Concepts – 5

5. Repository Service

As we already discussed about metadata repository, now we discuss a

separate,multi-threaded process that retrieves, inserts and updates metadata in
the repository database tables, it is Repository Service.
Repository service manages connections to the PowerCenter repository from
PowerCenter client applications like Desinger, Workflow Manager, Monitor,
Repository manager, console and integration service. Repository service is
responsible for ensuring the consistency of metdata in the repository.

Creation & Properties:

Use the PowerCenter Administration Console Navigator window to create a

Repository Service. The properties needed to create are,

Service Name – name of the service like rep_SalesPerformanceDev

Location – Domain and folder where the service is created
License – license service name
Node, Primary Node & Backup Nodes – Node on which the service process runs
CodePage – The Repository Service uses the character set encoded in the repository code
page when writing data to the repository
Database type & details – Type of database, username, pwd, connect string and
The above properties are sufficient to create a repository service, however we can take a
look at following features which are important for better performance and maintenance.
General Properties

iGATE Internal
> OperatingMode: Values are Normal and Exclusive. Use Exclusive mode to perform
administrative tasks like enabling version control or promoting local to global repository
> EnableVersionControl: Creates a versioned repository

Node Assignments: “High availability option” is licensed feature which allows us to

choose Primary & Backup nodes for continuous running of the repository service. Under
normal licenses would see only only Node to select from
Database Properties

> DatabaseArrayOperationSize: Number of rows to fetch each time an array database

operation is issued, such as insert or fetch. Default is 100

> DatabasePoolSize:Maximum number of connections to the repository database that

the Repository Service can establish. If the Repository Service tries to establish more
connections than specified for DatabasePoolSize, it times out the connection attempt
after the number of seconds specified for DatabaseConnectionTimeout

Advanced Properties
> CommentsRequiredFor Checkin: Requires users to add comments when checking in
repository objects.

> Error Severity Level: Level of error messages written to the Repository Service log.
Specify one of the following message levels: Fatal, Error, Warning, Info, Trace & Debug

> EnableRepAgentCaching:Enables repository agent caching. Repository agent caching

provides optimal performance of the repository when you run workflows. When you
enable repository agent caching, the Repository Service process caches metadata
requested by the Integration Service. Default is Yes.
> RACacheCapacity:Number of objects that the cache can contain when repository agent
caching is enabled. You can increase the number of objects if there is available memory
on the machine running the Repository Service process. The value must be between 100
and 10,000,000,000. Default is 10,000
> AllowWritesWithRACaching: Allows you to modify metadata in the repository when
repository agent caching is enabled. When you allow writes, the Repository Service
process flushes the cache each time you save metadata through the PowerCenter Client
tools. You might want to disable writes to improve performance in a production
environment where the Integration Service makes all changes to repository metadata.
Default is Yes.

Environment Variables

The database client code page on a node is usually controlled by an environment

variable. For example, Oracle uses NLS_LANG, and IBM DB2 uses DB2CODEPAGE. All
Integration Services and Repository Services that run on this node use the same
environment variable. You can configure a Repository Service process to use a different

iGATE Internal
value for the database client code page environment variable than the value set for the

You might want to configure the code page environment variable for a Repository
Service process when the Repository Service process requires a different database client
code page than the Integration Service process running on the same node.

For example, the Integration Service reads from and writes to databases using the UTF-8
code page. The Integration Service requires that the code page environment variable be
set to UTF-8. However, you have a Shift-JIS repository that requires that the code page
environment variable be set to Shift-JIS. Set the environment variable on the node to
UTF-8. Then add the environment variable to the Repository Service process properties
and set the value to Shift-JIS.

 Informatica PowerCenter 8x Key Concepts – 1

We shall look at the fundamental components of the Informatica PowerCenter 8.x Suite,
the key components are

1. PowerCenter Domain
2. PowerCenter Repository
3. Administration Console
4. PowerCenter Client
5. Repository Service
6. Integration Service

PowerCenter Domain

A domain is the primary unit for management and administration of services in

PowerCenter. Node, Service Manager and Application Services are components of a


Node is the logical representation of a machine in a domain. The machine in which the
PowerCenter is installed acts as a Domain and also as a primary node. We can add other
machines as nodes in the domain and configure the nodes to run application services such
as the Integration Service or Repository Service. All service requests from other nodes in
the domain go through the primary node also called as ‘master gateway’.

The Service Manager

iGATE Internal
The Service Manager runs on each node within a domain and is responsible for starting
and running the application services. The Service Manager performs the following

 Alerts. Provides notifications of events like shutdowns, restart

 Authentication. Authenticates user requests from the Administration Console,
PowerCenter Client, Metadata Manager, and Data Analyzer
 Domain configuration. Manages configuration details of the domain like machine
name, port
 Node configuration. Manages configuration details of a node metadata like
machine name, port
 Licensing. When an application service connects to the domain for the first time
the licensing registration is performed and for subsequent connections the
licensing information is verified
 Logging. Manages the event logs from each service, the messages could be
‘Fatal’, ‘Error’, ‘Warning’, ‘Info’
 User management. Manages users, groups, roles, and privileges

Application services

The services that essentially perform data movement, connect to different data sources
and manage data are called Application services, they are namely Repository Service,
Integration Service, Web Services Hub, SAPBW Service, Reporting Service and
Metadata Manager Service. The application services run on each node based on the way
we configure the node and the application service

Domain Configuration

Some of the configurations for a domain involves assigning host name, port numbers to
the nodes, setting up Resilience Timeout values, providing connection information of
metadata Database, SMTP details etc. All the Configuration information for a domain is
stored in a set of relational database tables within the repository. Some of the global
properties that are applicable for Application Services like ‘Maximum Restart Attempts’,
‘Dispatch Mode’ as ‘Round Robin’/’Metric Based’/’Adaptive’ etc are configured under
Domain Configuration

Informatica 7.x vs 8.x


In Informatica 8.1 has an addition of transformations and supports different unstructured


iGATE Internal
1. sql transformation
2. java transformation
3. support unstructured data like emails, word doc, and pdfs.
4. In custom transformation we can build the transformation using java or vc++
5.Concept of flat file updation is also introduced in 8.x

Object Permissions

Effective in version 8.1.1, you can assign object permissions to users when
you add a user account, set user permissions, or edit an object.

Gateway and Logging Configuration

Effective in version 8.1, you configure the gateway node and location for
log event files on the Properties tab for the domain. Log events describe
operations and error messages for core and application services, workflows
and sessions.

Log Manager runs on the master gateway node in a domain.

We can configure the maximum size of logs for automatic purge in megabytes.
Powercenter 8.1 also provides enhancements to the Log Viewer and log event

Unicode compliance

Effective in version 8.1, all fields in the Administration Console accept

Unicode characters. One can choose UTF-8 character set as the repository
code page to store multiple languages.

Memory and CPU Resource Usage

You may notice an increase in memory and CPU resource usage on machines
running PowerCenter Services.

Domain Configuration Database

PowerCenter stores the domain configuration in a database.

License Usage

Effective in version 8.0, the Service Manager registers license information.

High Availability

iGATE Internal
High availability is the PowerCenter option that eliminates a single point
of failure in the PowerCenter environment and provides minimal service
interruption in the event of failure. High availability provides the
following functionality:
Resilience: Resilience is the ability for services to tolerate transient
failures, such as loss of connectivity to the database or network failure


 Few tips related to Informatica 8.x environment

Let us discuss some special scenarios that we might face in Informatica 8.x environments.

A. In Informatica 8.x, multiple integration services can be enabled under one node. In case if
there is a need to determine the process associated with an Integration service or Repository
service, then it can be done as follows.
If there are multiple Integration Services enabled in a node, there are multiple pmserver
processes running on the same machine. In PowerCenter 8.x, it is not possible to differentiate
between the processes and correlate it to a particular Integration Service, unlike in 7.x where
every pmserver process is associated with a specific pmserver.cfg file. Likewise, if there are
multiple Repository Services enabled in a node, there are multiple pmrepagent processes
running on the same machine. In PowerCenter 8.x, it is not possible to differentiate between
the processes and correlate it to a particular Integration Service.
To do these in 8.x do the following:

1. Log on to the Administration Console

2. Click on Logs > Display Settings.

3. Add Process to the list of columns to be displayed in the Log Viewer.

4. Refresh the log display.

5. Use the PID from this column to identify the process as follows:


Run the following command:

ps –ef grep pid

Where pid is the process ID of the service process.

a. Run task manager.
b. Select the Processes tab.
Scroll to the value in the PID column that is displayed in the PowerCenter Administration

iGATE Internal
B. Sometimes, the PowerCenter Administration Console URL is inaccessible from some
machines even when the Informatica services are running. The following error is displayed on
the browser:

“The page cannot be displayed”

The reason for this is due to an invalid or missing configuration in the hosts file on the client

To resolve this error, do the following:

1. Edit the hosts file located in the windows/system32/drivers/etc folder on the server
from where the Administration Console is being accessed.
2. Add the host IP address and the host name (for the host where the PowerCenter
services are installed).

Example ha420f3

1. Launch the Administration Console and access the login page by typing the URL:
http://<host>:<port>/adminconsole in the browser address bar.

It should be noted that the host name in the URL matches the host entry in the hosts file.

 HOW TO: Disable the XMLW_31213 error messages in the


Problem Description

The following error is in session log:

Error XMLW_31213 : Row rejected due to duplicate primary


Data is written correctly to target, but above error will appear multiple times in session
log and it will grow to very large size. Is there away to stop this message from being
written to the session log?


In order to fix the issue do the following:

1. Select Integration Service in the Administration Console

2. Go to Properties > Configuration Properties >
3. Click Edit
4. Clear the XMLWarnDupRows option.

iGATE Internal
More Information

Setting the custom property XMLWarnDupRows to “NO” will not resolve this issue as it
has been replaced by the XMLWarnDupRows Integration Service property.

To use the option on PowerCenter 8.5.1 install the latest HotFix.


PowerCenter Administrator Guide > “Creating and Configuring the Integration Service”
> “Configuring the Integration Service Properties” > “Configuration Properties” > “Table
9-6. Configuration Properties for an Integration Service”

 HOW TO: Achieve High Availability for Web Services Hub in

the event of a failover


Using High Availability for Web Services Hub, a client application can access the Web
Service even in the event of failover of the Web Service to another node (when the URL
changes). To achieve this, use either of the following:

 Third party load balancer

 Sample third party load balancers provided by Informatica (used only for test

Third party load balancer

In the Administration Console, provide the HubLogicalAddress in the Advance properties

of the Web Services Hub. For the HubLogicalAddress, provide the URL of the third party
load balancer that manages the Web Services Hub. This URL is published in the WSDL
for all Web Services that run on a Web Services Hub managed by the load balancer.

With a load balancer, the client application sends a request to the load balancer and the
load balancer routes the request to an available Web Services Hub. Any of the Web
Services Hub services can process requests from the client application. The load balancer
does not verify that the host names and port numbers given for the Web Services Hub
services are valid or that the services are running.

Before you send requests through the load balancer, ensure that the Web Service Hub
services are available.

iGATE Internal
Some of the third party load balancers that can be used (available on the Web) are Apache
tcpmon and Apache jmeter.

Sample third party load balancers provided by Informatica

Informatica provides sample third party load balancers (used only for test environment).
This can be used to understand the usage of load balancer with web Services Hub. This
sample third party load balancer is present at

Before using load balancer, read the Readme.txt file which is present in the same path.
This guides you on:

 How does the load balancer work

 How to set up the Configuration File that is used by the load balancer


Informatica third party load balancers can be used only for test environment and not for
production environment.




What is Normalization?

Normalization is the process of efficiently organizing data in a database. There are two
goals of the normalizaton process::

Eliminating redundant data.

Ensuring data dependencies.

First Normal Form

First Normal form (1 NF) sets the very basic rules for an organized database.

 Eliminate duplicative columns from the same table

 Create separate tables for each group of related data and identify each
row with a unique column or set of columns(the primary key)

iGATE Internal
Second Normal Form

Second Normal form (2 NF) further addresses the concept of removing duplicative data.

 Meet all the requirements of the first normal form.

 Remove subsets of data that apply to multiple rows of a table and place
them in separate tables.

 Create relationships between these new tables and their predecessors

through the use of foreign keys.

Third Normal Form

Third Normal form (3 NF) remove columns which are not dependent upon the primary

 Meet all the requirements of the second normal form.

 Remove columns that are not dependent upon the primary key.

 Handling Oracle Exceptions

There may be requirements wherein certain oracle exceptions need to be treated as
Warnings and certain exceptions need to be treated as Fatal.

Normally, a fatal Oracle error may not be registered as a warning or row error and the
session may not fail, conversely a non-fatal error may cause a PowerCenter session to
fail.This can be changed with few tweaking in

A. Oracle Stored Procedure

B. The Oracle ErrorActionFile and

C. Server Settings

Let us see this with an example.

An Oracle Stored Procedure under certain conditions returns the exception

NO_DATA_FOUND. When this exception occurs, the session calling the Stored
Procedure does not fail.

Adding an entry for this error in the ora8err.act file and enabling the
OracleErrorActionFile option does not change this behavior (Both ora8err.act and
OracleErrorActionFile are discussed in later part of this blog).

iGATE Internal
When this exception (NO_DATA_FOUND) is raised in PL/SQL it is sent to the Oracle
client as an informational message not an error message and the Oracle client sends this
message to PowerCenter. Since the Oracle client does not return an error to PowerCenter
the session continues as normal and will not fail.

A. Modify the Stored Procedure to return a different exception or a custom exception. A

custom exception number (only between -20000 and -20999) can be sent using the
raise_application_error PL/SQL command as follows:

raise_application_error (-20991,’<stored procedure name> has raised an error’, true);

Additionally add the following entry to the ora8err.act file:

20991, F

B. Editing the Oracle Error Action file can be done as follows:

1. Go to the server/bin directory under the Informatica Services installation directory

(8.x) or the Informatica Server installation directory (7.1.x).


For Infa 7.x

C:\Program Files\Informatica PowerCenter 7.1.3\Server\ora8err.act

For Infa 8.x


2. Open the ora8err.act file.

3. Change the value associated with the error.

“F” is fatal and stops the session.
“R” is a row error and writes the row to the reject file and continues to the next row.


To fail a session when the ORA-03114 error is encountered change the 03114 line in the
file to the following:

03114, F

To return a row error when the ORA-02292 error is encountered change the 02292 line to
the following:

iGATE Internal
02292, R

Note that the Oracle action file only applies to native Oracle connections in the session. If
the target is using the SQL*Loader external loader option, the message status will not be
modified by the settings in this file.

C. Once the file is modified, following changes need to be done in the server level.

Infa 8.x

Set the OracleErrorActionFile Integration Service Custom Property to the name of the
file (ora8err.act by default) as follows:

1. Connect to the Administration Console.

2. Stop the Integration Service.

3. Select the Integration Service.

4. Under the Properties tab, click Edit in the Custom Properties section.

5. Under Name enter OracleErrorActionFile.

6. Enter ora8err.act for the parameter under Value.

7. Click OK.

8. Start the Integration Service.

PowerCenter 7.1.x

In PowerCenter 7.1.x do the following:


For the server running on UNIX:

1. Using a text editor open the PowerCenter server configuration file (pmserver.cfg).

2. Add the following entry to the end of the file:


3. Re-start the PowerCenter server (pmserver).


iGATE Internal
For the server running on Windows:

1. Click Start, click Run, type regedit, and click OK.

2. Go to the following registry key:


Select Edit; New; String Value. Enter the “OracleErrorActionFile” for the string value.
Select Edit; Modify.

Enter the directory and the file name of the Oracle error action file:



The default entry for PowerCenter 7.1.3 would be:

C:\Program Files\Informatica PowerCenter 7.1.3\Server\ora8err.act

And for PowerCenter8.1.1 it would be


Click OK

 Informatica and Oracle hints in SQL overrides

HINTS used in a SQL statement helps in sending instructions to the Oracle optimizer which would
quicken the query processing time involved. Can we make use of these hints in SQL overrides within
our Informatica mappings so as to improve a query performance?

On a general note any Informatica help material would suggest: you can enter any valid SQL
statement supported by the source database in a SQL override of a Source qualifier or a Lookup
transformation or at the session properties level.

While using them as part of Source Qualifier has no complications, using them in a Lookup SQL
override gets a bit tricky. Use of forward slash followed by an asterix (“/*”) in lookup SQL Override
[generally used for commenting purpose in SQL and at times as Oracle hints.] would result in
session failure with an error like:

TE_7017 : Failed to Initialize Server Transformation lkp_transaction

2009-02-19 12:00:56 : DEBUG : (18785 | MAPPING) : (IS | Integration_Service_xxxx) :
node01_UAT-xxxx : DBG_21263 : Invalid lookup override


FULL(AC_Sales) PARALLEL(AC_Sales,12) */ min(OrderSeq) From AC_Sales)

iGATE Internal
This is because Informatica’s parser fails to recognize this special character when used in a Lookup
override. There has been a parameter made available starting with PowerCenter 7.1.3 release,
which enables the use of forward slash or hints.

 Infa 7.x
1. Using a text editor open the PowerCenter server configuration file (pmserver.cfg).
2. Add the following entry at the end of the file:
3. Re-start the PowerCenter server (pmserver).

 Infa 8.x
1. Connect to the Administration Console.
2. Stop the Integration Service.
3. Select the Integration Service.
4. Under the Properties tab, click Edit in the Custom Properties section.
5. Under Name enter LookupOverrideParsingSetting
6. Under Value enter 1.
7. Click OK.
8. And start the Integration Service.

 Starting with PowerCenter 8.5, this change could be done at the session task itself
as follows:

1. Edit the session.

2. Select Config Object tab.
3. Under Custom Properties add the attribute LookupOverrideParsingSetting and set the Value
to 1.
4. Save the session.

Tags: HINTS in Lookup SQL override, Oracle HINTS in Informatica


Star schema architecture is the simplest data warehouse design. The main feature of a star
schema is a table at the center, called the fact table and the dimension tables which allow
browsing of specific categories, summarizing, drill-downs and specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form,
while dimensional tables are de-normalized (second normal form).

Fact table

The fact table is not a typical relational database table as it is de-normalized on purpose -
to enhance query response times. The fact table typically contains records that are ready
to explore, usually with ad hoc queries. Records in the fact table are often referred to as
events, due to the time-variant nature of a data warehouse environment.
The primary key for the fact table is a composite of all the columns except numeric
values / scores (like QUANTITY, TURNOVER, exact invoice date and time).

iGATE Internal
Typical fact tables in a global enterprise data warehouse are (apart for those, there may be
some company or business specific fact tables):

sales fact table - contains all details regarding sales

orders fact table - in some cases the table can be split into open orders and historical
orders. Sometimes the values for historical orders are stored in a sales fact table.
budget fact table - usually grouped by month and loaded once at the end of a year.
forecast fact table - usually grouped by month and loaded daily, weekly or monthly.
inventory fact table - report stocks, usually refreshed daily

Dimension table

Nearly all of the information in a typical fact table is also present in one or more
dimension tables. The main purpose of maintaining Dimension Tables is to allow
browsing the categories quickly and easily.
The primary keys of each of the dimension tables are linked together to form the
composite primary key of the fact table. In a star schema design, there is only one de-
normalized table for a given dimension.

Typical dimension tables in a data warehouse are:

time dimension table

customers dimension table
products dimension table
key account managers (KAM) dimension table
sales office dimension table
Star schema example

iGATE Internal
An example of a star schema architecture is depicted below.

Snowflake schema architecture is a more complex variation of a star schema design. The
main difference is that dimensional tables in a snowflake schema are normalized, so they
have a typical relational database design.

Snowflake schemas are generally used when a dimensional table becomes very big and
when a star schema can’t represent the complexity of a data structure. For example if a
PRODUCT dimension table contains millions of rows, the use of snowflake schemas
should significantly improve performance by moving out some data to other table (with
BRANDS for instance).

The problem is that the more normalized the dimension table is, the more complicated
SQL joins must be issued to query them. This is because in order for a query to be
answered, many tables need to be joined and aggregates generated.

An example of a snowflake schema architecture is depicted below.

iGATE Internal

For each star schema or snowflake schema it is possible to construct a fact constellation
This schema is more complex than star or snowflake architecture, which is because it
contains multiple fact tables. This allows dimension tables to be shared amongst many
fact tables.
That solution is very flexible, however it may be hard to manage and support.

The main disadvantage of the fact constellation schema is a more complicated design
because many variants of aggregation must be considered.

In a fact constellation schema, different fact tables are explicitly assigned to the
dimensions, which are for given facts relevant. This may be useful in cases when some
facts are associated with a given dimension level and other facts with a deeper dimension

Use of that model should be reasonable when for example, there is a sales fact table (with
details down to the exact date and invoice header id) and a fact table with sales forecast
which is calculated based on month, client id and product id.

In that case using two different fact tables on a different level of grouping is realized
through a fact constellation model.

iGATE Internal
Data Integration Challenge – Understanding Lookup Process–

One of the basic ETL steps that we would use in most of the ETL jobs during
development is ‘Lookup’. We shall discuss further on what lookup is? when to use? how
it works ? and some points to be considered while using a lookup process.

What is lookup process?

During the process of reading records from a source system and loading into a target table
if we query another table or file (called ‘lookup table’ or ‘lookup file’) for retrieving
additional data then its called a ‘lookup process’. The ‘lookup table or file’ can reside on
the target or the source system. Usually we pass one or more column values that has been
read from the source system to the lookup process in order to filter and get the required

How ETL products implement lookup process?

There are three ways ETL products perform ‘lookup process’

 Direct Query: Run the required query against the table or file whenever the
‘lookup process’ is called up
 Join Query: Run a query joining the source and the lookup table/file before
starting to read the records from the source.
 Cached Query: Run a query to cache the data from the lookup table/file local to
the ETL server as a cache file. When the data flow from source then run the
required query against the cache file whenever the ‘lookup process’ is called up

Most of the leading products like Informatica, DataStage support all the three ways in
their product architecture. We shall see the pros and cons of this process and how these
work in part II.

ransformation : Normalizer (nrm_) Explanation : Normalizer transformation is a active

and connected transformation.


Use the Normalizer transformation with COBOL sources, which are often stored in a
denormalized format
You can also use the Normalizer transformation with relational sources to create multiple
rows from a single row of data.

iGATE Internal

Objective : To create a mapping which converts a single row into multiple rows.

Mapping Flow : Source Definition (Flat File) > Source Qualifier > Expression (column
names) > Normalizer transformation (converts single row into multiple rows)> Target
Definition (flat file)

Designing : We designed this mapping using INFORMATICA tool version 7.1.1 .

Description :

Source Definition


100 JOHN 2000 23
101 SMITH 5000 41
102 LUCKY 6000 32

Target Definition


100 ESAL 2000
100 EAGE 23
101 ESAL 5000
101 EAGE 41
102 ESAL 6000
102 EAGE 32

Download : source file click here (flat file)

Download :
m_norm_col_rows Single row into multiple rows

iGATE Internal
Exceptions in Informatica – 2
Let us see few more strange exceptions in Informatica

1. Sometimes the Session fails with the below error message.

“FATAL ERROR : Caught a fatal signal/exception
FATAL ERROR : Aborting the DTM process due to fatal signal/exception.”

There might be several reasons for this. One possible reason could be the way the
function SUBSTR is used in the mappings, like the length argument of the SUBSTR
function being specified incorrectly.


In this example MOBILE_NUMBER is a variable port and is 24 characters long.

When the field itself is 24 char long, the SUBSTR starts at position 2 and go for a length
of 24 which is the 25th character.

To solve this, correct the length option so that it does not go beyond the length of the
field or avoid using the length option to return the entire string starting with the start

In this example modify the expression as follows:




2. The following error can occur at times when a session is run

“TE_11015 Error in xxx: No matching input port found for output port OUTPUT_PORT
TM_6006 Error initializing DTM for session…”

Where xxx is a Transformation Name.

iGATE Internal
This error will occur when there is corruption in the transformation.
To resolve this do one of the following: * Recreate the transformation in the mapping
having this error.

3. At times you get the below problems,

1. When opening designer, you get “Exception access violation”, “Unexpected condition

2. Unable to see the navigator window, output window or the overview window in
designer even after toggling it on.

3. Toolbars or checkboxes are not showing up correctly.

These are all indications that the pmdesign.ini file might be corrupted. To solve this,
following steps need to be followed.

1. Close Informatica Designer

2. Rename the pmdesign.ini (in c:\winnt\system32 or c:\windows\system).
3. Re-open the designer.

When PowerMart opens the Designer, it will create a new pmdesign.ini if it doesn’t find
an existing one. Even reinstalling the PowerMart clients will not create this file if it finds


Informatica Exceptions – 3

Here are few more Exceptions:

1. There are occasions where sessions fail with the following error in the Workflow

“First error code [36401], message [ERROR: Session task instance [session XXXX]:
Execution terminated unexpectedly.] “

where XXXX is the session name.

The server log/workflow log shows the following:

“LM_36401 Execution terminated unexpectedly.”

iGATE Internal
To determine the error do the following:

a. If the session fails before initialization and no session log is created look for errors in
Workflow log and pmrepagent log files.

b. If the session log is created and if the log shows errors like

“Caught a fatal signal/exception” or

“Unexpected condition detected at file [xxx] line yy”

then a core dump has been created on the server machine. In this case Informatica
Technical Support should be contacted with specific details. This error may also occur
when the PowerCenter server log becomes too large and the server is no longer able to
write to it. In this case a workflow and session log may not be completed. Deleting or
renaming the PowerCenter Server log (pmserver.log) file will resolve the issue.

2. Given below is not an exception but a scenario which most of us would have come

Rounding problem occurs with columns in the source defined as Numeric with Precision
and Scale or Lookups fail to match on the same columns. Floating point arithmetic is
always prone to rounding errors (e.g. the number 1562.99 may be represented internally
as 1562.988888889, very close but not exactly the same). This can also affect functions
that work with scale such as the Round() function. To resolve this do the following:

a. Select the Enable high precision option for the session.

b. Define all numeric ports as Decimal datatype with the exact precision and scale
desired. When high precision processing is enabled the PowerCenter Server support
numeric values up to 28 digits. However, the tradeoff is a performance hit (actual
performance really depends on how many decimal ports there are).

Exceptions in Informatica

There exists no product/tool without strange exceptions/errors, we will see some of those
1. You get the below error when you do “Generate SQL” in Source Qualifier and try to
validate it.
“Query should return exactly n field(s) to match field(s) projected from the Source
Where n is the number of fields projected from the Source Qualifier.

Possible reasons for this to occur are:

iGATE Internal
1. The order of ports may be wrong
2. The number of ports in the transformation may be more/less.
3. Sometimes you will have the correct number of ports and in correct order too but even
then you might face this error in that case make sure that Owner name and Schema name
are specified correctly for the tables used in the Source Qualifier Query.

2. The following error occurs at times when an Oracle table is used

/common/odl/oracle8/oradriver.cpp] line [xxx]”
Where xxx is some line number mostly 241, 291 or 416.

Possible reasons are

1. Use DataDirect Oracle ODBC driver instead of the driver “Oracle in

2. If the table has been imported using the Oracle drivers which are not supported, then
columns with Varchar2 data type are replaced by String data type and Number columns
are imported with precision Zero(0).

3. Recently I encountered the below error while trying to save a Mapping.

Unexpected Condition Detected

Warning: Unexpected condition at: statbar.cpp: 268
Contact Informatica Technical Support for assistance

When there is no enough memory in System this happens. To resolve this we can either

1. Increase the Virtual Memory in the system

2. If continue to receive the same error even after increasing the Virtual
Memory, in Designer, go to ToolsàOptions, go to General tab and clear the
“Save MX Data” option.




It provides the context /descriptive It provides measurement of an enterprise.
information for a fact table

iGATE Internal

Structure of Dimension - Surrogate Measurement is the amount determined by

key , one or more other fields that observation.
compose the natural key (nk) and set
of Attributes.
Size of Dimension Table is smaller Structure of Fact Table - foreign key (fk),
than Fact Table. Degenerated Dimension and Measurements.

. In a schema more number of Size of Fact Table is larger than Dimension

dimensions are presented than Fact Table.
Surrogate Key is used to prevent the In a schema less number of Fact Tables
primary key (pk) violation(store observed compared to Dimension Tables.
historical data).
Provides entry points to data. Compose of Degenerate Dimension fields act
as Primary Key.
Values of fields are in numeric and Values of the fields always in numeric or
text representation. integer form.




A scaled - down version of the Data It is a database management system that
Warehouse that addresses only one facilitates on-line analytical processing by
subject like Sales Department, HR allowing the data to be viewed in different
Department etc., dimensions or perspectives to provide
business intelligence.
One fact table with multiple dimension More than one fact table and multiple
tables. dimension tables.
[Sales Department] [HR Department] [Sales Department , HR Department ,
[Manufacturing Department] Manufacturing Department]
Small Organizations prefer Bigger Organization prefer DATA



iGATE Internal
Data mining involves using techniques Web mining involves the analysis of Web
to find underlying structure and server logs of a Web site.
relationships in large amounts of data.
Data mining products tend to fall into The Web server logs contain the entire
five categories: neural networks, collection of requests made by a potential
knowledge discovery, data visualization, or current customer through their browser
fuzzy query analysis and case-based and responses by the Web server


On Line Transaction processing On Line Analytical processing
Continuously updates data Read Only Data
Tables are in normalized form Partially Normalized / Denormalized Tables
Single record access Multiple records for analysis purpose
Holds current data Holds current and historical data
Records are maintained using Primary key Records are baased on surogate keyfield
Delete the table or record Cannot delete the records
Complex data model Simplified data model

 Data Integration Challenge – Understanding Lookup Process


In Part II we discussed ‘when to use’ and ‘when not to use’ the particular type of lookup
process, the Direct Query lookup, Join based lookup and the Cache file based lookup.
Now we shall see what are the points to be considered for better performance of these
‘lookup’ types.
In the case of Direct Query the following points are to be considered
 Index on the lookup condition columns
 Selecting only the required columns

In the case of Join based lookup, the following points are to be considered
 Index on the columns that are used as part of Join conditions
 Selecting only the required columns

iGATE Internal
In the case of Cache file based lookup, let us first try to understand the process of how
these files are built and queried.
The key aspects of a Lookup Process are the
 SQL that pulls the data from lookup table
 Cache memory/files that holds the data
 Lookup Conditions that query the cache memory/file
 Output Columns that are returned back from the cache files

Cache file build process:

Based on the product Informatica or Datastage when a lookup process is being designed
we would define the ‘lookup conditions’ or the ‘key fields’ and also define a list of fields
that would need to be returned on lookup query. Based on these definitions the required
data is pulled from lookup table and the cache file is populated with the data. The cache
file structure is optimized for data retrieval assuming that the cache file would be queried
based certain set of columns called ‘lookup conditions’ or ‘key fields’.

In the case of Informatica, the cache file is of separate index and data file, the index file
has the fields that are part of the ‘lookup condition’ and the data file has the fields that are
to be returned. Datastage cache files are called Hash files which are optimized based on
the ‘key fields’.

Cache file query process:

Irrespective of the product of choice following would be the steps involved internally
when a lookup process is invoked.

1. Get the Inputs for Lookup Query, Lookup Condition and Columns to be returned
2. Load the cache file to memory
3. Search the record(s) matching the Lookup condition values , in case of
Informatica this search happens on the ‘index file’
4. Pull the required columns matching the condition and return, in case of
Informatica with the result from ‘index file’ search, the data from the ‘data file’ is
located and retrieved

In the search process, based on the memory availability there could be many disk hits and
page swapping.

So in terms performance tuning we could look at two levels

1. how to optimize the cache file building process
2. how to optimize cache file query process

The following table lists the points to be considered for the better performance of a cache
file based lookup
Category Points to consider
Optimize Cache file • While retrieving the records to build the cache file, sort the
building process records by the lookup condition, this sorting would speed up

iGATE Internal
the index (file) building process. This is because the search
tree of the Index file would be built faster with lesser node
• Select only the required fields there by reducing the cache
file size
• Reusing the same cache file for multiple requirements for
same or slightly varied lookup conditions
• Sort the records that come from source to query the cache
file by the lookup condition columns, this ensures less page
swapping and page hits. If the subsequent input source
records come in a continuous sorted order then the hits of
the required index data in the memory is high and the disk
Optimize Cache file
swapping is reduced
query process
• Having a dedicated separate disk ensures a reserved space
for the lookup cache files and also improves response of
writing to the disk and reading from the disk
• Avoid querying recurring lookup condition, by sorting the
incoming records by the lookup condition

= Data Integration Challenge – Capturing Changes

When we receive the data from source systems, the data file will not carry a flag
indicating whether the record provided is new or has it changed. We would need to build
process to determine the changes and then push them to the target table.

There are two steps to it

1. Pull the incremental data from the source file or table

2. Process the pulled incremental data and determine the impact of it on the target
table as Insert or Update or Delete

Step 1: Pull the incremental data from the source file or table
If source system has audit columns like date then we can find the new records else we
will not be able to find the new records and have to consider the complete data
For source system’s file or table that has audit columns, we would follow the below steps

1. While reading the source records for a day (session), find the maximum value of
date(audit filed) and store in a persistent variable or a temporary table
2. Use this persistent variable value as a filter in the next day to pull the incremental
data from the source table

Step 2: Determine the impact of the record on target table as Insert/Update/ Delete
Following are the scenarios that we would face and the suggested approach

iGATE Internal
1. Data file has only incremental data from Step 1 or the source itself provide only
incremental data
o do a lookup on the target table and determine whether it’s a new record or
an existing record
o if an existing record then compare the required fields to determine whether
it’s an updated record
o have a process to find the aged records in the target table and do a clean up
for ‘deletes’
2. Data file has full complete data because no audit columns are present
o The data is of higher
 have a back up of the previously received file
 perform a comparison of the current file and prior file; create a
‘change file’ by determining the inserts, updates and deletes.
Ensure both the ‘current’ and ‘prior’ file are sorted by key fields
 have a process that reads the ‘change file’ and loads the data into
the target table
 based on the ‘change file’ volume, we could decide whether to do a
‘truncate & load’
o The data is of lower volume
 do a lookup on the target table and determine whether it’s a new
record or an existing record
 if an existing record then compare the required fields to determine
whether it’s an updated record
 have a process to find the aged records in the target table and do a
clean up or delete

 Data Integration Challenge – Storing Timestamps

Storing timestamps along with a record indicating its new arrival or a change in its value
is a must in a data warehouse. We always take it for granted, adding timestamp fields to
table structures and tending to miss that the amount of storage space a timestamp field
can occupy is huge, the storage occupied by timestamp is almost double against a integer
data type in many databases like SQL Server, Oracle and if we have two fields one as
insert timestamp and other field as update timestamp then the storage spaced required
gets doubled. There are many instances where we could avoid using timestamps
especially when the timestamps are being used for primarily for determining the
incremental records or being stored just for audit purpose.

How to effectively manage the data storage and also leverage the benefit of a
timestamp field?

One way of managing the storage of timestamp field is by introducing a process id field
and a process table. Following are the steps involved in applying this method in table
structures and as well as part of the ETL process.
Data Structure

iGATE Internal
1. Consider a table name PAYMENT with two fields with timestamp data type like
INSERT_TIMESTAMP and UPDATE_TIEMSTAMP used for capturing the
changes for every present in the table
2. Create a table named PROCESS_TABLE with columns PROCESS_NAME
Char(25), PROCESS_ID Integer and PROCESS_TIMESTAMP Timestamp
3. Now drop the fields of the TIMESTAMP data type from table PAYMENT
4. Create two fields of integer data type in the table PAYMENT like
5. These newly created id fields INSERT_PROCESS_ID and
UPDATE_PROCESS_ID would be logically linked with the table
PROCESS_TABLE through its field PROCESS_ID
ETL Process
1. Let us consider an ETL process called ‘Payment Process’ that loads data into the
2. Now create a pre-process which would run before the ‘payment process’, in the
pre-process build the logic by which a record is inserted with the values like
(‘payment process’, SEQUNCE Number, current timestamp) into the
could be defined as a database sequence function.
3. Pass the currently generated PROCESS_ID of PROCESS_TABLE as
‘current_process_id’ from pre-process step to the ‘payment process’ ETL process
4. In the ‘payment process’ if a record is to inserted into the PAYMENT table then
the current_prcoess_id value is set to both the columns INSERT_PROCESS_ID
and UPDATE_PROCESS_ID else if a record is getting updated in the PAYMENT
table then the current_process_id value is set to only the column
5. So now the timestamp values for the records inserted or updated in the table
PAYMENT can be picked from the PROCESS_TABLE by joining by the
columns of the PAYMENT table
 The fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID occupy less
space when compared to the timestamp fields
 Both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID are
Index friendly
 Its easier to handle these process id fields in terms picking the records for
determining the incremental changes or for any audit reporting.

Used to create reusable
Used to create mapplets
Reusable transformation :: It contains a set of transformations
Reusable transformations can be and allows you to reuse that

iGATE Internal
transformation logic in multiple
used in multiple mappings.


ETL - Extract, Transform, Load data Reporting :: To generate reports
Cleansing data,Validating data,Business Convert ETL data into Cubes, Pie charts,
Logic on the data graphs etc.,
Developed by Reporting developer .End
Developed by ETL developer user access the data in terms of Business
Tools :: Informatica, Data Stage, Oracle Tools :: Business Objects, Hyperion,
Warehouse etc., Cognos etc.,



Normalization: is a gradual process Denormalization: is the reverse of
of removing redundancies of normalization wherein the emphasis is
attributes in a data structure. The on increasing redundancies. This is
condition of the data at completion ofdone for performance enhancement
each step is described as a "normal and reduced query time.
form." Normalization is usually
considered to be a good way to
design a database, however it
involves high CPU time to process a
task or query.
Dimensional tables are in
Fact table in Normalization form.
denormalized form.


A table in a data warehouse whose A dimensional table is a collection of
entries describe data in a fact table. hierarchies and categories along which
Dimension tables contain the data from the user can drill down and drill up. it
which dimensions are created. A fact contains only the textual attributes.
table in data ware house is it describes
the transaction data. It contains

iGATE Internal
characteristics and key figures.

In a Data Model schema less number of In a Data Model schema more number of
fact tables are observed. dimensional tables are observed.


* Used for OLTP systems * Used for OLAP systems
* Traditional and old schema * New generation schema
* Normalized * Denormalized
* Difficult to understand and navigate * Easy to understand and navigate
* Cannot solve extract and complex * Extract and complex problems can be
problems easily solved
* Poorly modelled * Very good model

Oracle VS Teradata:
Both the database has there advantages & disadvantages.
There are a lot of factors to be taken into consideration before deciding which database is
If you are talking about OLTP systems then Oracle is far better than Teradata.
Oracle is more flexible in terms of programming like u can write
Packages,procedures,functions .
Teradata is useful if you want to generate reports on a very huge database.
But the recent versions of Oracle like 10g is quite good & contains a lot of features to
support DataWareHouse.

 Teradata is a MPP System which really can process the complex queries very
fastly..Another advantage is the uniform distribution of data through the Unique primary
indexes with out any overhead. Recently we had an evaluation with experts from both
Oracle and Teradata for OLAP system,and they were really impressed with the
performance of Teradata over Oracle.

iGATE Internal
iGATE Internal