Vous êtes sur la page 1sur 69

what is fact less fact table? where you have used it in your project?

In Schema level we can say as when we consider these as Count of Occurances/even that
does not involved or getting aggreaged at the fact table means,we call these fact(has no
measure but with events/occurance) called as Fact less Fact table.

for eg. No : of Policy has been closed this months.

Generally we using the factless fact when we want events that happen only at information
level but not included in the calculations level.just an information about an event that
happen over a period.

Fact Table
The centralized table in a star schema is called as FACT table. A fact table typically has
two types of columns: those that contain facts and those that are foreign keys to
dimension tables. The primary key of a fact table is usually a composite key that is made
up of all of its foreign keys.

In the example fig 1.6 "Sales Dollar" is a fact(measure) and it can be added across several
dimensions. Fact tables store different types of measures like additive, non additive and

Measure Types
 Non Additive - Measures that cannot be added across all dimensions.
 Semi Additive - Measures that can be added across few dimensions and not with
others.

A fact table might contain either detail level facts or facts that have been aggregated (fact
tables that contain aggregated facts are often instead called summary tables).

In the real world, it is possible to have a fact table that contains no measures or facts.
These tables are called as Factless Fact tables.

Steps in designing Fact Table

 Identify a business process for analysis(like sales).
 Identify measures or facts (sales dollar).
 Identify dimensions for facts(product dimension, location dimension, time
dimension, organization dimension).
 List the columns that describe each dimension.(region name, branch name, region
name).

iGATE Internal
 Determine the lowest level of summary in a fact table(sales dollar).

Example of a Fact Table with an Additive Measure in Star Schema: Figure 1.6

In the example figure 1.6, sales fact table is connected to dimensions location, product,
time and organization. Measure "Sales Dollar" in sales fact table can be added across all
dimensions independently or in a combined manner which is explained below.

 Sales Dollar value for a particular product

 Sales Dollar value for a product in a location
 Sales Dollar value for a product in a year within a location
 Sales Dollar value for a product in a year within a location sold or serviced by an
employee

--

In Snowflake schema, the example diagram shown below has 4 dimension tables, 4
lookup tables and 1 fact table. The reason is that hierarchies(category, branch, state, and
month) are being broken out of the dimension tables(PRODUCT, ORGANIZATION,
LOCATION, and TIME) respectively and shown separately. In OLAP, this Snowflake
schema approach increases the number of joins and poor performance in retrieval of data.
In few organizations, they try to normalize the dimension tables to save space. Since
dimension tables hold less space, Snowflake schema approach may be avoided.

iGATE Internal
Example of Snowflake Schema: Figure 1.7

Star schema

General Information
In general, an organization is started to earn money by selling a product or by providing
service to the product. An organization may be at one place or may have several
branches.

When we consider an example of an organization selling products throughtout the world,

the main four major dimensions are product, location, time and organization. Dimension
tables have been explained in detail under the section Dimensions. With this example, we
will try to provide detailed explanation about STAR SCHEMA.

What is Star Schema?

Star Schema is a relational database schema for representing multimensional data. It is
the simplest form of data warehouse schema that contains one or more dimensions and
fact tables. It is called a star schema because the entity-relationship diagram between
dimensions and fact tables resembles a star where one fact table is connected to multiple
dimensions. The center of the star schema consists of a large fact table and it points

iGATE Internal
towards the dimension tables. The advantage of star schema are slicing down,
performance increase and easy understanding of data.

Steps in designing Star Schema

 Identify a business process for analysis(like sales).
 Identify measures or facts (sales dollar).
 Identify dimensions for facts(product dimension, location dimension, time
dimension, organization dimension).
 List the columns that describe each dimension.(region name, branch name, region
name).
 Determine the lowest level of summary in a fact table(sales dollar).

Important aspects of Star Schema & Snow Flake Schema

 In a star schema every dimension will have a primary key.
 In a star schema, a dimension table will not have any parent table.
 Whereas in a snow flake schema, a dimension table will have one or more parent
tables.
 Hierarchies for the dimensions are stored in the dimensional table itself in star
schema.
 Whereas hierachies are broken into separate tables in snow flake schema. These
hierachies helps to drill down the data from topmost hierachies to the lowermost
hierarchies.

lossary:

Hierarchy
A logical structure that uses ordered levels as a means of organizing data. A hierarchy can
be used to define data aggregation; for example, in a time dimension, a hierarchy might
be used to aggregate data from the Month level to the Quarter level, from the Quarter
level to the Year level. A hierarchy can also be used to define a navigational drill path,
regardless of whether the levels in the hierarchy represent aggregated totals or not.

Level
A position in a hierarchy. For example, a time dimension might have a hierarchy that
represents data at the Month, Quarter, and Year levels.

Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table
typically has two types of columns: those that contain facts and those that are foreign
keys to dimension tables. The primary key of a fact table is usually a composite key that
is made up of all of its foreign keys.

A fact table might contain either detail level facts or facts that have been aggregated (fact
tables that contain aggregated facts are often instead called summary tables). A fact table
usually contains facts with the same level of aggregation.

iGATE Internal
Example of Star Schema: Figure 1.6

In the example figure 1.6, sales fact table is connected to dimensions location, product,
time and organization. It shows that data can be sliced across all dimensions and again it
is possible for the data to be aggregated across multiple dimensions. "Sales Dollar" in
sales fact table can be calculated across all dimensions independently or in a combined
manner which is explained below.

 Sales Dollar value for a particular product

 Sales Dollar value for a product in a location
 Sales Dollar value for a product in a year within a location
 Sales Dollar value for a product in a year within a location sold or serviced by an
employee

--
Data Warehouse & Data Mart
A data warehouse is a relational/multidimensional database that is designed for query and
analysis rather than transaction processing. A data warehouse usually contains historical
data that is derived from transaction data. It separates analysis workload from transaction

n addition to a relational/multidimensional database, a data warehouse environment often

consists of an ETL solution, an OLAP engine, client analysis tools, and other applications
that manage the process of gathering data and delivering it to business users.

There are three types of data warehouses:

1. Enterprise Data Warehouse - An enterprise data warehouse provides a central
database for decision support throughout the enterprise.
2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike
the real entertprise data warehouse, data is refreshed in near real time and used for routine

iGATE Internal
3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region,

Data warehouses and data marts are built on dimensional data modeling where fact tables
are connected with dimension tables. This is most useful for users to access data since a
database can be visualized as a cube of several dimensions. A data warehouse provides an
opportunity for slicing and dicing that cube along each of its dimensions.

Data Mart: A data mart is a subset of data warehouse that is designed for a particular
line of business, such as sales, marketing, or finance. In a dependent data mart, data can
be derived from an enterprise-wide data warehouse. In an independent data mart, data can
be collected directly from sources.

--

Slowly Changing Dimensions

Dimensions that change over time are called Slowly Changing Dimensions. For instance,
a product price changes over time; People change their names for some reason; Country
and State names may change over time. These are a few examples of Slowly Changing
Dimensions since some changes are happening to them over a period of time.

Slowly Changing Dimensions are often categorized into three types namely Type1,
Type2 and Type3. The following section deals with how to capture and handling these
changes over time.

iGATE Internal
The "Product" table mentioned below contains a product named, Product1 with Product
ID being the primary key. In the year 2004, the price of Product1 was \$150 and over the
time, Product1's price changes from \$150 to \$350. With this information, let us explain
the three types of Slowly Changing Dimensions.

Product ID(PK) Year Product Name Product Price

1 2004 Product1 \$150

Type 1: Overwriting the old values.

In the year 2005, if the price of the product changes to \$250, then the old values of the
columns "Year" and "Product Price" have to be updated and replaced with the new
values. In this Type 1, there is no way to find out the old value of the product "Product1"
in year 2004 since the table now contains only the new price and year information.

Product

Product ID(PK) Year Product Name Product Price

1 2005 Product1 \$250

Type 2: Creating an another additional record.

In this Type 2, the old values will not be replaced but a new row containing the new
values will be added to the product table. So at any point of time, the difference between
the old values and new values can be retrieved and easily be compared. This would be
very useful for reporting purposes.

Product

Product ID(PK) Year Product Name Product Price

1 2004 Product1 \$150
1 2005 Product1 \$250

The problem with the above mentioned data structure is "Product ID" cannot store
duplicate values of "Product1" since "Product ID" is the primary key. Also, the current
data structure doesn't clearly specify the effective date and expiry date of Product1 like
when the change to its price happened. So, it would be better to change the current data
structure to overcome the above primary key violation.

Product

iGATE Internal
Product Effective Product Product Expiry
Year
ID(PK) DateTime(PK) Name Price DateTime
01-01-2004 12-31-2004
1 2004 Product1 \$150
12.00AM 11.59PM
01-01-2005
1 2005 Product1 \$250
12.00AM

In the changed Product table's Data structure, "Product ID" and "Effective DateTime" are
composite primary keys. So there would be no violation of primary key constraint.
Addition of new columns, "Effective DateTime" and "Expiry DateTime" provides the
information about the product's effective date and expiry date which adds more clarity
and enhances the scope of this table. Type2 approach may need additional space in the
data base, since for every changed record, an additional row has to be stored. Since
dimensions are not that big in the real world, additional space is negligible.

Type 3: Creating new fields.

In this Type 3, the latest update to the changed values can be seen. Example mentioned
below illustrates how to add new columns and keep track of the changes. From that, we
are able to see the current price and the previous price of the product, Product1.

Product

Current Old Product

Product Current
Product ID(PK) Old Year
Name Product Price
Year Price
1 2005 Product1 \$250 \$150 2004

The problem with the Type 3 approach, is over years, if the product price continuously
changes, then the complete history may not be stored, only the latest change will be
stored. For example, in year 2006, if the product1's price changes to \$350, then we would
not be able to see the complete history of 2004 prices, since the old values would have
been updated with 2005 product information.

Product

Product Old Product

Product
Product ID(PK) Year Old Year
Name
Price Price
1 2006 Product1 \$350 \$250 2005

--

Time Dimension
In a relational data model, for normalization purposes, year lookup, quarter lookup,

iGATE Internal
month lookup, and week lookups are not merged as a single table. In a dimensional data
modeling(star schema), these tables would be merged as a single table called TIME
DIMENSION for performance and slicing data.

This dimensions helps to find the sales done on date, weekly, monthly and yearly basis.
We can have a trend analysis by comparing this year sales with the previous year or this
week sales with the previous week.

Year Lookup

Year Id Year Number DateTimeStamp

1 2004 1/1/2005 11:23:31 AM
2 2005 1/1/2005 11:23:31 AM

Quarter Lookup

Quarter Number Quarter Name DateTimeStamp

1 Q1 1/1/2005 11:23:31 AM
2 Q2 1/1/2005 11:23:31 AM
3 Q3 1/1/2005 11:23:31 AM
4 Q4 1/1/2005 11:23:31 AM

Month Lookup

Month Number Month Name DateTimeStamp

1 January 1/1/2005 11:23:31 AM
2 February 1/1/2005 11:23:31 AM
3 March 1/1/2005 11:23:31 AM
4 April 1/1/2005 11:23:31 AM

iGATE Internal
5 May 1/1/2005 11:23:31 AM
6 June 1/1/2005 11:23:31 AM
7 July 1/1/2005 11:23:31 AM
8 August 1/1/2005 11:23:31 AM
9 September 1/1/2005 11:23:31 AM
10 October 1/1/2005 11:23:31 AM
11 November 1/1/2005 11:23:31 AM
12 December 1/1/2005 11:23:31 AM

Week Lookup

Week Number Day of Week DateTimeStamp

1 Sunday 1/1/2005 11:23:31 AM
1 Monday 1/1/2005 11:23:31 AM
1 Tuesday 1/1/2005 11:23:31 AM
1 Wednesday 1/1/2005 11:23:31 AM
1 Thursday 1/1/2005 11:23:31 AM
1 Friday 1/1/2005 11:23:31 AM
1 Saturday 1/1/2005 11:23:31 AM
2 Sunday 1/1/2005 11:23:31 AM
2 Monday 1/1/2005 11:23:31 AM
2 Tuesday 1/1/2005 11:23:31 AM
2 Wednesday 1/1/2005 11:23:31 AM
2 Thursday 1/1/2005 11:23:31 AM
2 Friday 1/1/2005 11:23:31 AM
2 Saturday 1/1/2005 11:23:31 AM

Time Dimension

Time Day Month Day

Year Quarter Month Month Week DateTime
Dim Of Day of Cal Date
No No No Name No Stamp
Id Year No Week
1/1/2005
1 2004 1 Q1 1 January 1 1 5 1/1/2004 11:23:31
AM
1/1/2005
2 2004 32 Q1 2 February 1 5 1 2/1/2004 11:23:31
AM
1/1/2005
3 2005 1 Q1 1 January 1 1 7 1/1/2005 11:23:31
AM

iGATE Internal
1/1/2005
4 2005 32 Q1 2 February 1 5 3 2/1/2005 11:23:31
AM

Organization Dimension
In a relational data model, for normalization purposes, corporate office lookup, region
lookup, branch lookup, and employee lookups are not merged as a single table. In a
dimensional data modeling(star schema), these tables would be merged as a single table
called ORGANIZATION DIMENSION for performance and slicing data.

This dimension helps us to find the products sold or serviced within the organization by
the employees. In any industry, we can calculate the sales on region basis, branch basis
and employee basis. Based on the performance, an organization can provide incentives to
employees and subsidies to the branches to increase further sales.

Example of Organization Dimension: Figure 1.10

Product Dimension
In a relational data model, for normalization purposes, product category lookup, product
sub-category lookup, product lookup, and and product feature lookups are are not merged
as a single table. In a dimensional data modeling(star schema), these tables would be
merged as a single table called PRODUCT DIMENSION for performance and slicing
data requirements.

iGATE Internal
Example of Product Dimension: Figure 1.9

Dimension Table
Dimension table is one that describe the business entities of an enterprise, represented as
hierarchical, categorical information such as time, departments, locations, and products.
Dimension tables are sometimes called lookup or reference tables.

Location Dimension
In a relational data modeling, for normalization purposes, country lookup, state lookup,
county lookup, and city lookups are not merged as a single table. In a dimensional data
modeling(star schema), these tables would be merged as a single table called LOCATION
DIMENSION for performance and slicing data requirements. This location dimension
helps to compare the sales in one region with another region. We may see good sales
profit in one region and loss in another region. If it is a loss, the reasons for that may be a
new competitor in that area, or failure of our marketing strategy etc.

Example of Location Dimension: Figure 1.8

iGATE Internal
Relational Data Modeling is used in OLTP systems which are transaction oriented and
Dimensional Data Modeling is used in OLAP systems which are analytical based. In a
data warehouse environment, staging area is designed on OLTP concepts, since data has
to be normalized, cleansed and profiled before loaded into a data warehouse or data mart.
In OLTP environment, lookups are stored as independent tables in detail whereas these
independent tables are merged as a single dimension in an OLAP environment like data
warehouse.

Relational vs Dimensional
Relational Data Modeling Dimensional Data Modeling
Data is stored in RDBMS or Multidimensional
Data is stored in RDBMS
databases
Tables are units of storage Cubes are units of storage
Data is normalized and used for OLTP. Data is denormalized and used in datawarehouse
Optimized for OLTP processing and data mart. Optimized for OLAP
Several tables and chains of Few tables and fact tables are connected to
relationships among them dimensional tables
Non volatile and time invariant
variant
SQL is used to manipulate data MDX is used to manipulate data
Summary of bulky transactional data(Aggregates
Detailed level of transactional data
and Measures) used in business decisions
User friendly, interactive, drag and drop
Normal Reports
multidimensional OLAP Reports

Dimensional Data Modeling

Dimensional Data Modeling comprises of one or more dimension tables and fact tables.
Good examples of dimensions are location, product, time, promotion, organization etc.
Dimension tables store records related to that particular dimension and no
facts(measures) are stored in these tables.

For example, Product dimension table will store information about products(Product
Category, Product Sub Category, Product and Product Features) and location dimension
table will store information about location( country, state, county, city, zip. A
fact(measure) table contains measures(sales gross value, total units sold) and dimension
columns. These dimension columns are actually foreign keys from the respective
dimension tables.

iGATE Internal
Example of Dimensional Data Model: Figure 1.6

In the example figure 1.6, sales fact table is connected to dimensions location, product,
time and organization. It shows that data can be sliced across all dimensions and again it
is possible for the data to be aggregated across multiple dimensions. "Sales Dollar" in
sales fact table can be calculated across all dimensions independently or in a combined
manner which is explained below.

 Sales Dollar value for a particular product

 Sales Dollar value for a product in a location
 Sales Dollar value for a product in a year within a location
 Sales Dollar value for a product in a year within a location sold or serviced by an
employee

In Dimensional data modeling, hierarchies for the dimensions are stored in the
dimensional table itself. For example, the location dimension will have all of its
hierarchies from country, state, county to city. There is no need for the individual
hierarchial lookup like country lookup, state lookup, county lookup and city lookup to be
shown in the model.

Uses of Dimensional Data Modeling

Dimensional Data Modeling is used for calculating summarized data. For example, sales
data could be collected on a daily basis and then be aggregated to the week level, the
week data could be aggregated to the month level, and so on. The data can then be
referred to as aggregate data. Aggregation is synonymous with summarization, and
aggregate data is synonymous with summary data. The performance of dimensional data
modeling can be significantly increased when materialized views are used. Materialized
view is a pre-computed table comprising aggregated or joined data from fact and possibly
dimension tables which also known as a summary or aggregate table.

iGATE Internal
--
Logical vs Physical Data Modeling
Logical Data Model Physical Data Model
Represents business information and Represents the physical implementation of the
defines business rules model in a database.
Entity Table
Attribute Column
Primary Key Primary Key Constraint
Alternate Key Unique Constraint or Unique Index
Inversion Key Entry Non Unique Index
Rule Check Constraint, Default Value
Relationship Foreign Key
Definition Comment

Data Modeler Role

» Interact with Business Analysts to get the functional requirements.
» Interact with end users and find out the reporting needs.
» Conduct interviews, brain storming discussions with project team to get additional
requirements.
» Gather accurate data by data analysis and functional analysis.

Development of data model:

» Create standard abbreviation document for logical, physical and dimensional data
models.
» Create logical, physical and dimensional data models(data warehouse data modelling).
» Document logical, physical and dimensional data models (data warehouse data
modelling).

Reports:
» Generate reports from data model.

Review:
» Review the data model with functional and technical team.

Creation of database:
» Create sql code from data model and co-ordinate with DBAs to create database.
» Check to see data models and databases are in synch.

iGATE Internal
Support & Maintenance:
» Assist developers, ETL, BI team and end users to understand the data model.
» Maintain change log for each data model.

Steps to create a Data Model

These are the general guidelines to create a standard data model and in real time, a data
model may not be created in the same sequential manner as shown below. Based on the
enterprise’s requirements, some of the steps may be excluded or included in addition to
these.

Sometimes, data modeler may be asked to develop a data model based on the existing
database. In that situation, the data modeler has to reverse engineer the database and
create a data model.

2» Create High Level Conceptual Data Model.
3» Create Logical Data Model.
4» Select target DBMS where data modeling tool creates the physical schema.
5» Create standard abbreviation document according to business standard.
6» Create domain.
7» Create Entity and add definitions.
8» Create attribute and add definitions.
9» Based on the analysis, try to create surrogate keys, super types and sub types.
10» Assign datatype to attribute. If a domain is already present then the attribute should
be attached to the domain.
11» Create primary or unique keys to attribute.
12» Create check constraint or default to attribute.
13» Create unique index or bitmap index to attribute.
14» Create foreign key relationship between entities.
15» Create Physical Data Model.
15» Add database properties to physical data model.
16» Create SQL Scripts from Physical Data Model and forward that to DBA.
17» Maintain Logical & Physical Data Model.
18» For each release (version of the data model), try to compare the present version with
the previous version of the data model. Similarly, try to compare the data model with the
database to find out the differences.
19» Create a change log document for differences between the current version and
previous version of the data model.

--
http://www.learndatamodeling.com/

iGATE Internal
Informatica:

Informatica is a widely used ETL tool for extracting the source data and loading it into
the target after applying the required transformation. In the following section, we will try
to explain the usage of Informatica in the Data Warehouse environment with an example.
Here we are not going into the details of data warehouse design and this tutorial simply
provides the overview about how INFORMATICA can be used as an ETL tool.

Note: The exchanges/companies that are explained here is for illustrative purpose only.

Bombay Stock Exchange (BSE) and National Stock Exchange (NSE) are two major stock
exchanges in India in which the shares of ABC Corporation and XYZ Private Limited are
traded between Mondays through Friday except Holidays. Assume that a software
company “KLXY Limited” has taken the project to integrate the data between two
exchanges BSE and NSE.

ETL Process - Roles & Responsibilities:

In order to complete this task of integrating the Raw data received from NSE & BSE,
KLXY Limited allots responsibilities to Data Modelers, DBAs and ETL Developers.
During this entire ETL process, many IT professionals may involve, but we are
highlighting the roles of these three personals only for easy understanding and better
clarity.

 Data Modelers analyze the data from these two sources(Record Layout 1 &
Record Layout 2), design Data Models, and then generate scripts to create
necessary tables and the corresponding records.
 DBAs create the databases and tables based on the scripts generated by the data
modelers.
 ETL developers map the extracted data from source systems and load it to target
systems after applying the required transformations.

Overall Process:

The complete process of data transformation from external sources to our target data
warehouse is explained using the following sections. Each section will be explained in
detail.

 Data from the external sources (source1 - .CSV (comma seperated) file , source2 -
Oracle table)

iGATE Internal
 Source(s) table layout details
 Look up table details
 Target table layout details
 Defining Source table and target table in Informatica
 Implementing extraction mapping in Informatica (Mapping Designer)
 Workflow creation in Informatica (Workflow Manager)
 Verifying records through Informatica (Workflow Monitor)

 http://learnbi.com/informatica2.htm

ETL Testing:
Testing is an important phase in the project lifecycle. A structured well defined testing
methodology involving comprehensive unit testing and system testing not only ensures
smooth transition to the production environment but also a system without defects.

iGATE Internal
The testing phase can be broadly classified into the following categories:

 Integration Testing
 System Testing
 Regression Testing
 Performance Testing
 Operational Qualification

Test Strategy:

A test strategy is an outline that describes the test plan. It is created to inform the project
team the objective, high level scope of the testin process. This includes the testing
objective, methods of testing, resources, estimated timelines, environment etc.

The test strategy is created based on high level design document.For each testing
component test strategy needs to be created. based on this strategy testing process will be
detailed out in the test plan.

Test Planning:

Test Planning is a key for successfully implementing the testing of a system. The
deliverable is the actual �Test Plan�. A software project test plan is a document that
describes the Purpose, System Overview, Approach to Testing, Test Planning, Defect
Tracking, Test Environment, Test prerequisites and References.

A key prerequisite for preparing a successful Test Plan is having approved (functional and
non functional) requirements. Without the frozen requirements and specification the test
plan will result in the lack of validation for the projects testing efforts.

The process of preparing a test plan is a useful way to get to know how testing of a
particular system can be carried out within the given time line provided the test plan
should be thorough enough.

The test plan outlines and defines the strategy and approach taken to perform end-to-end
testing on a given project. The test plan describes the tasks, schedules, resources, and
tools for integrating and testing the software application. It is intended for use by project
personnel in understanding and carrying out prescribed test activities and in managing
these activities through successful completion.

 To define a testing approach, scope, out of scope and methodology that

encompasses integration testing, system testing, performance testing and
regression testing in one plan for the business and project team.

iGATE Internal
 To verify the functional and non functional requirements are met.
 To coordinate resources, environments into an integrated schedule.
 To provide a plan that outlines the contents of detailed test cases scenarios for
each of the four phases of testing.
 To determine a process for communicating issues resulting from the test phase.

 An introduction that includes a purpose, Definition & Acronym, Assumptions &

Dependencies, In scope, Out of scope, Roles& Responsibilities and contacts. This
information is obtained from the requirements specification.
 System Overview will explain about the background and the system description.
 A test approach for all testing levels includes test Objectives for each level, Test
responsibilities, Levels of testing, various testing, Test coverage, Testing tools,
Test data and Test stop criteria.
 Test planning specifies Test schedule, Documentation deliverables, Test
communication and Critical and High risk functions.

The test plan, thus, summarizes and consolidates information that is necessary for the
efficient and effective conduct of testing. Design Specification, Requirement Document
and Project plan supporting the finalization of testing are located in separate documents
and are referenced in the test plan.

Test Estimation:

Effective software project estimation is one of the most challenging and important
activities in the testing activities. However, it is the most essential part of a proper project
planning and control. Under-estimating a project leads to under-staffing it, running the
risk of low quality deliverables and resulting in loss of credibility as deadlines are
missed. So it is imperative to do a proper estimation during the planning stage.

 Estimating the size of the system to be tested.

 Estimating the effort in person-hours (1 person hours � number of working hour
in a day i.e. 8 hours)

After receiving the requirements the tester analyses the mappings that are
created/modified and study about the changes made. Based on the impact analysis, the
tester comes to know about how much time is needed for the whole testing process,
which consists of mapping analysis, test case preparation, test execution, defect reporting,
regression testing and final documentation. This calculated time is entered in the
Estimation time sheet.

iGATE Internal
Integration Testing:

Integration testing is to verify the required functionality of a mapping (Single ETL /

single session) in the environment specific to testing team (Test environment). This
testing should ensure that correct numbers of rows (validated records) are transferred to
the target from the source.

Integration testing is also used to verify initialization and incremental mappings

(sessions) functionality along with the pre-session and post-session scripts for
dependencies and the usage/consumption of relative indicator files to control
dependencies across multiple work streams (modules). During integration testing error-
handling processes, proper functionality of mapping variables and the appropriate

Prerequisites:

 Implementation Checklist for move from development to test.
 All unit testing completed and summarized.
 Data available in the test environment.
 Migration to the test environment from the development environment.

System Testing:

This environment integrates the various components and runs as a single unit. This
should include sequences of events that enable those different components to run as a
single unit and validate the data flow.

 Verify all the required functionality in the validation environment.

 Run end-to-end system test.
 Record initialization and incremental load statistics.
 Perform and mitigate performance of the entire system.
 Verify error-handling processes are working as designed.

Prerequisites:

 Finalized Implementation Checklist.

 All integration testing complete.
 Migration from the Test environment to the validation environment, as applicable.
 Production configuration and data available.

Regression Testing:

Regression Testing is performed after the developer fixes a defect reported. This testing is
to verify whether the identified defects are fixed and fixing of these defects does not
introduce any other new defects in the system / application. This testing will also be

iGATE Internal
performed when a Change Request is implemented on an existing production system.
After the Change Request (CR) is approved, the testing team takes the impact analysis as
input for designing the test cases for the CR.

Prerequisites:

 Finalized Implementation Checklist.

 All integration testing complete.

Performance Testing:

To determine the system performance under a particular workload / Service Level

Agreement (SLA).Ensures system meets the performance criteria and it can detect bottle
neck.

PUSH DOWN Optimization

I'm going to work on an ETL project -where source is oracle ,target is teradata and ETL
tool to be used is informatica.

There are two levels -one is load into staging(staging is also teradata) and second is
I query the oracle source tables and load into staging area.
Which of the approach is good -
1.create a one to one mapping to do this or
2.Use any of the tools offered by Teradata -like Mload,Tpump,etc in informatica and do
it.

Please tell me the Pros and Cons of these two approaches.

I've been told to use the first method(one to one mappings).

Please advice on the second level as well ( from staging to target) whether to use one to

I'm really afraid because there is an automatic preimary index getting created in Teradata
tables and this lead to rejection of records in some cases.

Ans : For phase 1 you can either use informatica mapping using teradata loader
data into staging table. Its preferable to use informatica with loader connections which
will be faster for develpment, and will only be one to one mapping.

For phase 2 stage to target, you must go for informatica and use the joins in case if there
is any. You'll have to write sql override queries which would replace the SCD

iGATE Internal
transformations in informatica. The query must distinguish the new records and update
records using a flag and do the operation of insert and update according to the flag value.

There is an option in informatica (pushdown optimization), which must be used when

you go for complex override queries.

Let me know if you have more doubt.

In case of primary index, it is the lifeline of a teradata table. You need to provide the
name of the primary index while creating a table, else it automatically takes the first
column as the primary index. Using the PI only the data gets loaded into TD as well as
retrieval is done using the PI only.

Phase 1:

To make things more clearer-you has suggested me use loader connection instead of
Relational Connection.

As indicated by you,does the loader connection will be faster than the relational
connection?

I hope this would help me in the one-time load/history load as the historical data is
provided in the form of oracle dumps.

But in the incremental loads- the source data is being pulled from the oracle database's
views(obviously there will be performance issues- but clients requirement is this way-
can't help them)- do you suggest again here as well the loader connection will be faster
than the relational connection?

Phase 2:
I'm not very clear in the explanation given by you here.

As highlighted by you,I'll be joining staging tables to load into targets.

I'll be using a lookup transformation to check whether the record is present or not.

Accordingly ,I'll be inserting or updating the target tables(by Tagging them accordingly)
using Update Statergy transformation.

Ofcourse I'll be maintaing an history by giving an "end-date" to the already existing

record - when update is done in the database and insert a new one with a sysdate as "start-
date"

Basically I'm using the lookup and update statergy transfoemation to achieve my SCD
concept here.

iGATE Internal
Is this the concept you have explained here?

New Question:

Please explain me what is this "pushdown optimization" in informatica?

The concept is the same, but the implementation process is different. You can do it in the
way you have described using US n lookup's , also the way I have mentioned in the
previous post. The SCD can be implemented with over ride querie for faster development
instead of going for transformers like lookup n US. When using queries for SCD joins
will be replacing lookups and US will be replaced by a flag column which says whether
the record is gonna be insert/update provided the granularity of record change is 1 day.

About extracting from Oracle views to teradata you'll have to go for relational

Informatica Push down optimization is a new concept embedded into Informatica 8
Version series.

Pushdown optimization is a concept where you can try to make most of the calculations
on the database side than doing it at the informatica level. For example if you need some
kind of aggregation, you can push those computations to be done at the database end. It is
basically to utilize the power of database, so you do it at source side as well as target side.

Take a look at pdf from wisdomforce page: Best Practices for high-speed data transfer

 http://dwhtechstudy.com/Docs/informatica/New_Features_Enhancements_PC8.pdf

= PUSH DOWN Optimization

We can push transformation logic to the source or target database when the Lookup
transformation contains a lookup override.
To perform source-side ,target-side , or full pushdown optimization for a session
containing lookup overrides , configure the session for pushdown optimization and select
a pushdown option that allows us to create views.We can also use full pushdown
optimization when we use the target loading option to treat source rows as delete.

http://www.docstoc.com/docs/6205071/Powercenter-Informatica-80

http://www.informatica.com/INFA_Resources/br_powercenter_6659.pdf

iGATE Internal
http://www.informatica.com/INFA_Resources/ds_high_availability_6674.pdf

http://www.docstoc.com

New features of Informatica 8

The following summarizes changes to PowerCenter domains that might affect

*Object Permissions*

Effective in version 8.1.1, you can assign object permissions to users when
you add a user account, set user permissions, or edit an object.

*Gateway and Logging Configuration*

Effective in version 8.1, you configure the gateway node and location for
log event files on the Properties tab for the domain. Log events describe
operations and error messages for core and application services, workflows
and sessions.

Log Manager runs on the master gateway node in a domain.

We can configure the maximum size of logs for automatic purge in megabytes.

Powercenter 8.1 also provides enhancements to the Log Viewer and log event
formatting.

*Unicode compliance*

Effective in version 8.1, all fields in the Administration Console accept

Unicode characters. One can choose UTF-8 character set as the repository
code page to store multiple languages.

*Memory and CPU Resource Usage*

You may notice an increase in memory and CPU resource usage on machines
running PowerCenter Services.

iGATE Internal

Effective in version 8.0, the Service Manager registers license information.

*High Availability* HA

High availability is the PowerCenter option that eliminates a single point

of failure in the PowerCenter environment and provides minimal service
interruption in the event of failure. High availability provides the
following functionality:

*Resilience. *Resilience is the ability for services to tolerate transient

failures, such as loss of connectivity to the database or network failures.

*Failover. *Failover is the migration of a service process or task to

another node when the node running the service process becomes unavailable.

*Recovery. *Recovery is the automatic or manual completion of tasks after an

application service is interrupted.

**

The Administration Console is a browser-based utility that enables you to

as adding domain users, deleting nodes, and creating services.

*Repository Security*

algorithm.

objects.

*Partitioning*

There is database partitioning available. Apart from this we can configure

dynamic partitioning based on nodes in a grid, the number of partitions in
the source table and the number of partitions option. The session creates
additional partitions and attributes at the run time.

*Recovery*

The recovery of workflow, session and task are more robust now. The state of

iGATE Internal
the workflow/session is now stored in the shared file system and not in
memory.

*FTP*

We have options to have partitioned FTP targets and Indirect FTP file
source(with file list).

*Performance*

Pushdown optimization

Uses increased performance by pushing transformation logic to the database

by analyzing the transformations and issuing SQL statements to sources and
targets. Only processes any transformation logic that it cannot push to the
database.

*Flat file performance*

We can create more efficient data conversion using the new version.

We can concurrently write to multiple files in a session with partitioned

targets.

One can specify a command for source or target file in a session. The
command can be to create a source file like 'cat a file'.

*Pmcmd/infacmd*

domains.

Infasetup program to setup domain and node properties.

*Mappings *

We can now build custom transformation enhancements in API using c++ and
Java code.

We can use partition threads to specific custom transformations with

partition enhancements.

New Java transformation is added which provides simple native programming

interface to define transformation with Java language.

iGATE Internal
User defined functions similar to macros in Excel files. Some new functions
are added such as COMPRESS, DECOMPRESS, AES_ENCRYPT, AES_DECRYPT, IN,
GREATEST, LEAST, PMT, RATE, RAND etc.

=

http://etlpowercenter.blogspot.com/

@@@

Lesson three creating a Pass-Through Mapping,Session and

Workflow

In this lesson, create very simple mapping and execute by session and workflow.
Mapping is based on last lesson source and target definitions. Between source and target
definition is source qualifier.

Source Qualifier
- When you add a relational or a flat file source definition to a mapping, you need to
connect it to a Source Qualifier transformation. The Source Qualifier transformation
represents the rows that the PowerCenter Server reads when it runs a sesson.
- Source Qualifier transformation can perform the following tasks:
 Join data originating from the same source database
 Filter rows when the PowerCenter Server reads source data
 Specify sorted ports
 Select only distinct values from the source
 Create a custom query to issue a special SELECT statement for the PowerCenter

Session and Workflow for this lesson are very simple. Looks like most
logic are in Mapping. Session just like diagram to link all Mapping.
Workflow is just runtime instance of session.

Creating Source Definitions:

In Source Analyzer, to create a source definition you have to have a source--Database ,
Flat File or XML.... So you can Import from source and edit it, but seems you can't create
by your own. For edit, you can add , delete and update columns, and add some meta data
to each table definition.

Creating Target Definitions:

In Target Designer, to create a target definition, you can drag and drop from source
definitions table list or create table by your self. Nothing special just like create a normal
table. But remember select correct database type.

iGATE Internal
Creating Target Tables:
In target Designer, you can create table in database and informatica SQL statement from
1. Click Targets > Generate/Execute SQL.
2. In the File Name field, enter the SQL DDL file name.
3. Select the Create Table, Drop Table, Foreign Key and Primary Key options.
4. Click the Generate and Execute button.

So from what I did and see , it's simple to create source and target
definition. Just like create new tables in database. One thing I don't
like, after important source definition from database. The layout view
is not well arranged like other database logic view. you have to
rearrange them by yourself. So many software can do this job
automatically, I don't know why PowerCenter Designer can't do this.

Lesson One, Respository Users, Groups and Folder

This lesson is pretty simple just show you how to connect to repository. My
environment was already setup by my company's administrator. So I do need
worry about install and setup repository server and service.

If you don't have environment, you have to install server and do configuration
them by yourself. I tale looked that part, just need very powerful machine. So I
just ignore that.

For this lesson, includes connect to a informatica repository (domain, server, port
, username and password) , create groups and create a folder. Setup permission
on that created folder. All these did in Informatica PowerCenter Repository
Manager. Nothing special, everybody can under stand it easily.

And in this lesson also includes creating source tables and data for this tutorial.
You need have a database and execute appropriate SQL file in PowerCenter
Designer. I used smpl_ms.sql for SQL Server. This SQL file includes table’s
schema and data. Informatica PowerCenter connects database target or source
by ODBC. I installed MS SQL Server Express in my local box. It's free and
seems ok, doesn’t eat too much resources compare to oracle. When you setup
ODBC, server is "locahost\sqlexpress".

After SQL execute you can see few new tables with some data in sql
server mater database.

Informatica PowerCenter Architecture

iGATE Internal
Before the 6 lessons. I must understand of PowerCenter Architecture. This is not copy
and paste from help. I try to write down by my own under stand. It's not big deal for
developer view. But may very help in job interview.

Domain and Service Architecture

- Domain:
 Domain is collection of nodes and services.
 Primary unit to centralize administration.

- Node:
 Logical representation of a machine in a domain.
 One node i each domain servers as a gateway for the domain.
 All processes in PowerCenter run as services on a node.

- Services:
 Two type of services: Core services and Application services.
 Core services: support the domain and application services. E.g. Domain service,
Log service.
 Application services: represent PowerCenter server-based functionality. E.g.
Repository service, Intergration service...

- Informatica Repository:
 Contains a set of metadata tables withing repository database that informatica
applications and tools access.

- Informatica Client:
 Manages users, define sources and targets, builds mappings and mapplets with
transformation logic, and create workflows to run the mapping logic. The
Informatica Client has four client applications: Repository Manager, Designer,
Workflow Manager and Workflow monitor.

Posted by Informatica beginner at 8:07 AM 0 comments

What's my plan
1. First follow the tutorial from Powercenter Help .
This includes totally 6 lessons.

2. If I can get Informatica PowerCenter 8 Developer training, I'll follow training agenda
I'm trying to get a on site training combine level one and two together.

Agenda:

 Data Integration Concepts

o Data Integration

iGATE Internal
o Mapping and Transformations

 PowerCenter Components and User Interface

o PowerCenter Architecture

o Lab - Using the Designer and Workflow Manager

 Source Qualifier

o Source Qualifier Transformation

o Velocity Methodology

o Lab B - Load Product Staging Table

o Source Pipelines

 Expression, Filter, File Lists and Workflow Scheduler

o Expression Editor

o Filter Transformation

o File Lists

o Workflow Scheduler

 Joins, Features and Techniques I

o Joiner Transformation

o Shortcuts

o Lab A - Load Sales Transaction Staging Table

iGATE Internal
o Lab B - Features and Techniques I

 Lookups and Reusable Transformations

o Lookup Transformation

o Reusable Transformations

o Lab B - Load Date Staging Table

 Debugger

o Debugging Mappings

o Lab - Using the Debugger

 Sequence Generator

o Lookup Caching

 Sorter, Aggregator and Self-Join

o Sorter Transformation

o Aggregator Transformation

o Active and Passive Transformations

o Data Concatenation

o Self-Join

 Router, Update Strategy and Overrides

o Router Transformation

o Update Strategy Transformation

iGATE Internal
o Expression Default Values

o Source Qualifier Override

o Target Override

o Dynamic Lookup

o Error Logging

o Unconnected Lookup Transformation

o System Variables

 Mapplets

o Mapplets

o Lab - Load Product Daily Aggregate Table

 Mapping Design

o Designing Mappings

o Workshop

o Workflow Variables

iGATE Internal
o Lab - Load Product Weekly Aggregate Table

o PMCMD Utility

o Worklets

o Lab - Load Inventory Fact Table

 Workflow Design

o Designing Workflows

o Workshop (Optional)

Agenda:

 Architecture Overview and High Availability

o Architectural overview

o Domains, nodes, and services

o Configuring services

o High Availability

 Mapping and Session Techniques

iGATE Internal
o Mapping parameters and variables and parameter files

o File lists

o Data driven aggregation

o Incremental aggregation

o Denormalization

 Workflow Techniques

o Workflow Control and Restart

o Dynamic Scheduling

o Pseudo-looping techniques

 Workflow Recovery

o Workflow recovery options

o State of operation

o Recovery using the command line

 Transaction Control

o Database Transactions

o PowerCenter transaction control options

o Transformation scope

 Error Handling

o Error categories

iGATE Internal
o Error logging

o Creating object queries

o Migration

o Comparing objects

o Repository Reporting

o Repository reports

 Memory Allocation

o Optimizing transformation caches

o Auto-cache sizing

 Performance Tuning Methodology

o Session dynamics

o Measuring performance

o Testing for bottlenecks

o Optimization techniques

 Pipeline Partitioning

o Pipeline types

o Multi-partition sessions

Posted by Informatica beginner at 8:00 AM 0 comments

iGATE Internal
Informatica Resources
I really can't find many useful website except Informatica own website.
But you have to be a partner or customer to access these site or have to buy documents.
The lucky thing is I found some Informatica Powercenter 7 documents are free in
Chinese.

Informatica Customer Portal

As an Informatica customer, you can access the Informatica Customer Portal site at
http://my.informatica.com. The site contains product information, user group information,
community.

Informatica Web Site

You can access the Informatica corporate web site at http://www.informatica.com. The
site contains information about Informatica, its background, upcoming events, and sales
offices. You will also find product and partner information. The services area of the site
includes important information about technical support, training and education, and
implementation services.

You can access the Informatica Developer Network at http://devnet.informatica.com. The

Informatica Developer Network is a web-based forum for third-party software
developers. The site contains information about how to create, market, and support
customer-oriented add-on solutions based on interoperability interfaces for Informatica
products.

As an Informatica customer, you can access the Informatica Knowledge Base at

http://my.informatica.com. Use the Knowledge Base to search for documented solutions
to known technical issues about Informatica products. You can also find answers to
frequently asked questions, technical white papers, and technical tips.

Posted by Informatica beginner at 7:46 AM 0 comments

iGATE Internal
Why I create this blog
I'm J2EE developer now, I'm interesting to become a Informatica Powercenter ETL
developer now. When I google online I can't find many useful website or article. So I
thing maybe create a blog to record my study trace and later may create a Informatica
Powercenter website.

First I'll follow tutorial in Informatica Powerceneter help as start point, after that I don't
know yet. Hope I can find something to continue or put it based on real project in my job.

@@@@

http://blogs.hexaware.com/informatica_way/informatica-powercenter-8x-
key-concepts-5.html

Informatica PowerCenter 8x Key Concepts – 5

5. Repository Service

the repository database tables, it is Repository Service.
Repository service manages connections to the PowerCenter repository from
PowerCenter client applications like Desinger, Workflow Manager, Monitor,
Repository manager, console and integration service. Repository service is
responsible for ensuring the consistency of metdata in the repository.

Use the PowerCenter Administration Console Navigator window to create a

Repository Service. The properties needed to create are,

Service Name – name of the service like rep_SalesPerformanceDev

Location – Domain and folder where the service is created
Node, Primary Node & Backup Nodes – Node on which the service process runs
CodePage – The Repository Service uses the character set encoded in the repository code
page when writing data to the repository
Database type & details – Type of database, username, pwd, connect string and
tablespacename
The above properties are sufficient to create a repository service, however we can take a
look at following features which are important for better performance and maintenance.
General Properties

iGATE Internal
> OperatingMode: Values are Normal and Exclusive. Use Exclusive mode to perform
administrative tasks like enabling version control or promoting local to global repository
> EnableVersionControl: Creates a versioned repository

Node Assignments: “High availability option” is licensed feature which allows us to

choose Primary & Backup nodes for continuous running of the repository service. Under
normal licenses would see only only Node to select from
Database Properties

> DatabaseArrayOperationSize: Number of rows to fetch each time an array database

operation is issued, such as insert or fetch. Default is 100

> DatabasePoolSize:Maximum number of connections to the repository database that

the Repository Service can establish. If the Repository Service tries to establish more
connections than specified for DatabasePoolSize, it times out the connection attempt
after the number of seconds specified for DatabaseConnectionTimeout

repository objects.

> Error Severity Level: Level of error messages written to the Repository Service log.
Specify one of the following message levels: Fatal, Error, Warning, Info, Trace & Debug

> EnableRepAgentCaching:Enables repository agent caching. Repository agent caching

provides optimal performance of the repository when you run workflows. When you
enable repository agent caching, the Repository Service process caches metadata
requested by the Integration Service. Default is Yes.
> RACacheCapacity:Number of objects that the cache can contain when repository agent
caching is enabled. You can increase the number of objects if there is available memory
on the machine running the Repository Service process. The value must be between 100
and 10,000,000,000. Default is 10,000
> AllowWritesWithRACaching: Allows you to modify metadata in the repository when
repository agent caching is enabled. When you allow writes, the Repository Service
process flushes the cache each time you save metadata through the PowerCenter Client
tools. You might want to disable writes to improve performance in a production
environment where the Integration Service makes all changes to repository metadata.
Default is Yes.

Environment Variables

The database client code page on a node is usually controlled by an environment

variable. For example, Oracle uses NLS_LANG, and IBM DB2 uses DB2CODEPAGE. All
Integration Services and Repository Services that run on this node use the same
environment variable. You can configure a Repository Service process to use a different

iGATE Internal
value for the database client code page environment variable than the value set for the
node.

You might want to configure the code page environment variable for a Repository
Service process when the Repository Service process requires a different database client
code page than the Integration Service process running on the same node.

For example, the Integration Service reads from and writes to databases using the UTF-8
code page. The Integration Service requires that the code page environment variable be
set to UTF-8. However, you have a Shift-JIS repository that requires that the code page
environment variable be set to Shift-JIS. Set the environment variable on the node to
UTF-8. Then add the environment variable to the Repository Service process properties
and set the value to Shift-JIS.

 Informatica PowerCenter 8x Key Concepts – 1

We shall look at the fundamental components of the Informatica PowerCenter 8.x Suite,
the key components are

1. PowerCenter Domain
2. PowerCenter Repository
4. PowerCenter Client
5. Repository Service
6. Integration Service

PowerCenter Domain

A domain is the primary unit for management and administration of services in

PowerCenter. Node, Service Manager and Application Services are components of a
domain.

Node

Node is the logical representation of a machine in a domain. The machine in which the
PowerCenter is installed acts as a Domain and also as a primary node. We can add other
machines as nodes in the domain and configure the nodes to run application services such
as the Integration Service or Repository Service. All service requests from other nodes in
the domain go through the primary node also called as ‘master gateway’.

The Service Manager

iGATE Internal
The Service Manager runs on each node within a domain and is responsible for starting
and running the application services. The Service Manager performs the following
functions,

 Authentication. Authenticates user requests from the Administration Console,
PowerCenter Client, Metadata Manager, and Data Analyzer
 Domain configuration. Manages configuration details of the domain like machine
name, port
 Node configuration. Manages configuration details of a node metadata like
machine name, port
 Licensing. When an application service connects to the domain for the first time
the licensing registration is performed and for subsequent connections the
licensing information is verified
 Logging. Manages the event logs from each service, the messages could be
‘Fatal’, ‘Error’, ‘Warning’, ‘Info’
 User management. Manages users, groups, roles, and privileges

Application services

The services that essentially perform data movement, connect to different data sources
and manage data are called Application services, they are namely Repository Service,
Integration Service, Web Services Hub, SAPBW Service, Reporting Service and
Metadata Manager Service. The application services run on each node based on the way
we configure the node and the application service

Domain Configuration

Some of the configurations for a domain involves assigning host name, port numbers to
the nodes, setting up Resilience Timeout values, providing connection information of
metadata Database, SMTP details etc. All the Configuration information for a domain is
stored in a set of relational database tables within the repository. Some of the global
properties that are applicable for Application Services like ‘Maximum Restart Attempts’,
‘Dispatch Mode’ as ‘Round Robin’/’Metric Based’/’Adaptive’ etc are configured under
Domain Configuration

4
Informatica 7.x vs 8.x

Ans

In Informatica 8.1 has an addition of transformations and supports different unstructured

data.
Introduced:

iGATE Internal
1. sql transformation
2. java transformation
3. support unstructured data like emails, word doc, and pdfs.
4. In custom transformation we can build the transformation using java or vc++
5.Concept of flat file updation is also introduced in 8.x

Object Permissions

Effective in version 8.1.1, you can assign object permissions to users when
you add a user account, set user permissions, or edit an object.

Gateway and Logging Configuration

Effective in version 8.1, you configure the gateway node and location for
log event files on the Properties tab for the domain. Log events describe
operations and error messages for core and application services, workflows
and sessions.

Log Manager runs on the master gateway node in a domain.

We can configure the maximum size of logs for automatic purge in megabytes.
Powercenter 8.1 also provides enhancements to the Log Viewer and log event
formatting.

Unicode compliance

Effective in version 8.1, all fields in the Administration Console accept

Unicode characters. One can choose UTF-8 character set as the repository
code page to store multiple languages.

Memory and CPU Resource Usage

You may notice an increase in memory and CPU resource usage on machines
running PowerCenter Services.

Effective in version 8.0, the Service Manager registers license information.

High Availability

iGATE Internal
High availability is the PowerCenter option that eliminates a single point
of failure in the PowerCenter environment and provides minimal service
interruption in the event of failure. High availability provides the
following functionality:
Resilience: Resilience is the ability for services to tolerate transient
failures, such as loss of connectivity to the database or network failure

 Few tips related to Informatica 8.x environment

Let us discuss some special scenarios that we might face in Informatica 8.x environments.

A. In Informatica 8.x, multiple integration services can be enabled under one node. In case if
there is a need to determine the process associated with an Integration service or Repository
service, then it can be done as follows.
If there are multiple Integration Services enabled in a node, there are multiple pmserver
processes running on the same machine. In PowerCenter 8.x, it is not possible to differentiate
between the processes and correlate it to a particular Integration Service, unlike in 7.x where
every pmserver process is associated with a specific pmserver.cfg file. Likewise, if there are
multiple Repository Services enabled in a node, there are multiple pmrepagent processes
running on the same machine. In PowerCenter 8.x, it is not possible to differentiate between
the processes and correlate it to a particular Integration Service.
To do these in 8.x do the following:

4. Refresh the log display.

5. Use the PID from this column to identify the process as follows:

UNIX:

Where pid is the process ID of the service process.

Windows:
1.
b. Select the Processes tab.
Scroll to the value in the PID column that is displayed in the PowerCenter Administration
Console.

iGATE Internal
B. Sometimes, the PowerCenter Administration Console URL is inaccessible from some
machines even when the Informatica services are running. The following error is displayed on
the browser:

“The page cannot be displayed”

The reason for this is due to an invalid or missing configuration in the hosts file on the client
machine.

To resolve this error, do the following:

1. Edit the hosts file located in the windows/system32/drivers/etc folder on the server
from where the Administration Console is being accessed.
2. Add the host IP address and the host name (for the host where the PowerCenter
services are installed).

Example
10.1.2.10 ha420f3

1. Launch the Administration Console and access the login page by typing the URL:

It should be noted that the host name in the URL matches the host entry in the hosts file.

 HOW TO: Disable the XMLW_31213 error messages in the

PowerCenter

Problem Description

Error XMLW_31213 : Row rejected due to duplicate primary

key.

Data is written correctly to target, but above error will appear multiple times in session
log and it will grow to very large size. Is there away to stop this message from being
written to the session log?

Solution

1. Select Integration Service in the Administration Console

2. Go to Properties > Configuration Properties >
3. Click Edit
4. Clear the XMLWarnDupRows option.

iGATE Internal

Setting the custom property XMLWarnDupRows to “NO” will not resolve this issue as it
has been replaced by the XMLWarnDupRows Integration Service property.

To use the option on PowerCenter 8.5.1 install the latest HotFix.

Reference

PowerCenter Administrator Guide > “Creating and Configuring the Integration Service”
> “Configuring the Integration Service Properties” > “Configuration Properties” > “Table
9-6. Configuration Properties for an Integration Service”

 HOW TO: Achieve High Availability for Web Services Hub in

the event of a failover

Solution

Using High Availability for Web Services Hub, a client application can access the Web
Service even in the event of failover of the Web Service to another node (when the URL
changes). To achieve this, use either of the following:

 Sample third party load balancers provided by Informatica (used only for test
environment)

of the Web Services Hub. For the HubLogicalAddress, provide the URL of the third party
load balancer that manages the Web Services Hub. This URL is published in the WSDL
for all Web Services that run on a Web Services Hub managed by the load balancer.

With a load balancer, the client application sends a request to the load balancer and the
load balancer routes the request to an available Web Services Hub. Any of the Web
Services Hub services can process requests from the client application. The load balancer
does not verify that the host names and port numbers given for the Web Services Hub
services are valid or that the services are running.

Before you send requests through the load balancer, ensure that the Web Service Hub
services are available.

iGATE Internal
Some of the third party load balancers that can be used (available on the Web) are Apache
tcpmon and Apache jmeter.

Sample third party load balancers provided by Informatica

Informatica provides sample third party load balancers (used only for test environment).
This can be used to understand the usage of load balancer with web Services Hub. This
sample third party load balancer is present at

This guides you on:

 How does the load balancer work

 How to set up the Configuration File that is used by the load balancer

Note

Informatica third party load balancers can be used only for test environment and not for
production environment.

http://blog.mydwbi.com/?p=126

http://www.dwhlabs.com/dwh_concepts/normalization.aspx

@@=>
Normalization

What is Normalization?

Normalization is the process of efficiently organizing data in a database. There are two
goals of the normalizaton process::

Eliminating redundant data.

Ensuring data dependencies.

First Normal Form

First Normal form (1 NF) sets the very basic rules for an organized database.

 Eliminate duplicative columns from the same table

 Create separate tables for each group of related data and identify each
row with a unique column or set of columns(the primary key)

iGATE Internal
Second Normal Form

Second Normal form (2 NF) further addresses the concept of removing duplicative data.

 Meet all the requirements of the first normal form.

 Remove subsets of data that apply to multiple rows of a table and place
them in separate tables.

 Create relationships between these new tables and their predecessors

through the use of foreign keys.

Third Normal Form

Third Normal form (3 NF) remove columns which are not dependent upon the primary
key.

 Meet all the requirements of the second normal form.

 Remove columns that are not dependent upon the primary key.

 Handling Oracle Exceptions

There may be requirements wherein certain oracle exceptions need to be treated as
Warnings and certain exceptions need to be treated as Fatal.

Normally, a fatal Oracle error may not be registered as a warning or row error and the
session may not fail, conversely a non-fatal error may cause a PowerCenter session to
fail.This can be changed with few tweaking in

B. The Oracle ErrorActionFile and

C. Server Settings

An Oracle Stored Procedure under certain conditions returns the exception

NO_DATA_FOUND. When this exception occurs, the session calling the Stored
Procedure does not fail.

Adding an entry for this error in the ora8err.act file and enabling the
OracleErrorActionFile option does not change this behavior (Both ora8err.act and
OracleErrorActionFile are discussed in later part of this blog).

iGATE Internal
When this exception (NO_DATA_FOUND) is raised in PL/SQL it is sent to the Oracle
client as an informational message not an error message and the Oracle client sends this
message to PowerCenter. Since the Oracle client does not return an error to PowerCenter
the session continues as normal and will not fail.

A. Modify the Stored Procedure to return a different exception or a custom exception. A

custom exception number (only between -20000 and -20999) can be sent using the
raise_application_error PL/SQL command as follows:

20991, F

1. Go to the server/bin directory under the Informatica Services installation directory

(8.x) or the Informatica Server installation directory (7.1.x).

E.g.,

For Infa 8.x

C:\Informatica\PowerCenter8.1.1\server\bin

3. Change the value associated with the error.

“F” is fatal and stops the session.
“R” is a row error and writes the row to the reject file and continues to the next row.

Examples:

To fail a session when the ORA-03114 error is encountered change the 03114 line in the
file to the following:

03114, F

To return a row error when the ORA-02292 error is encountered change the 02292 line to
the following:

iGATE Internal
02292, R

Note that the Oracle action file only applies to native Oracle connections in the session. If
the target is using the SQL*Loader external loader option, the message status will not be
modified by the settings in this file.

C. Once the file is modified, following changes need to be done in the server level.

Infa 8.x

Set the OracleErrorActionFile Integration Service Custom Property to the name of the
file (ora8err.act by default) as follows:

3. Select the Integration Service.

4. Under the Properties tab, click Edit in the Custom Properties section.

7. Click OK.

8. Start the Integration Service.

PowerCenter 7.1.x

UNIX

For the server running on UNIX:

1. Using a text editor open the PowerCenter server configuration file (pmserver.cfg).

2. Add the following entry to the end of the file:

OracleErrorActionFile=ora8err.act

3. Re-start the PowerCenter server (pmserver).

Windows

iGATE Internal
For the server running on Windows:

1. Click Start, click Run, type regedit, and click OK.

2. Go to the following registry key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\
PowerMart\Parameters\Configuration

Select Edit; New; String Value. Enter the “OracleErrorActionFile” for the string value.
Select Edit; Modify.

Enter the directory and the file name of the Oracle error action file:

\ora8err.act

Example:

The default entry for PowerCenter 7.1.3 would be:

C:\Program Files\Informatica PowerCenter 7.1.3\Server\ora8err.act

And for PowerCenter8.1.1 it would be

C:\Informatica\PowerCenter8.1.1\server\bin

Click OK

 Informatica and Oracle hints in SQL overrides

HINTS used in a SQL statement helps in sending instructions to the Oracle optimizer which would
quicken the query processing time involved. Can we make use of these hints in SQL overrides within
our Informatica mappings so as to improve a query performance?

On a general note any Informatica help material would suggest: you can enter any valid SQL
statement supported by the source database in a SQL override of a Source qualifier or a Lookup
transformation or at the session properties level.

While using them as part of Source Qualifier has no complications, using them in a Lookup SQL
override gets a bit tricky. Use of forward slash followed by an asterix (“/*”) in lookup SQL Override
[generally used for commenting purpose in SQL and at times as Oracle hints.] would result in
session failure with an error like:

TE_7017 : Failed to Initialize Server Transformation lkp_transaction

2009-02-19 12:00:56 : DEBUG : (18785 | MAPPING) : (IS | Integration_Service_xxxx) :
node01_UAT-xxxx : DBG_21263 : Invalid lookup override

SELECT SALES. SALESSEQ as SalesId, SALES.OrderID as ORDERID, SALES.OrderDATE as

ORDERDATE FROM SALES, AC_SALES WHERE AC_SALES. OrderSeq >= (Select /*+
FULL(AC_Sales) PARALLEL(AC_Sales,12) */ min(OrderSeq) From AC_Sales)

iGATE Internal
This is because Informatica’s parser fails to recognize this special character when used in a Lookup
override. There has been a parameter made available starting with PowerCenter 7.1.3 release,
which enables the use of forward slash or hints.

 Infa 7.x
1. Using a text editor open the PowerCenter server configuration file (pmserver.cfg).
2. Add the following entry at the end of the file:
LookupOverrideParsingSetting=1
3. Re-start the PowerCenter server (pmserver).

 Infa 8.x
1. Connect to the Administration Console.
2. Stop the Integration Service.
3. Select the Integration Service.
4. Under the Properties tab, click Edit in the Custom Properties section.
5. Under Name enter LookupOverrideParsingSetting
6. Under Value enter 1.
7. Click OK.
8. And start the Integration Service.

 Starting with PowerCenter 8.5, this change could be done at the session task itself
as follows:

1. Edit the session.

2. Select Config Object tab.
3. Under Custom Properties add the attribute LookupOverrideParsingSetting and set the Value
to 1.
4. Save the session.

Tags: HINTS in Lookup SQL override, Oracle HINTS in Informatica

STAR SCHEMA

Star schema architecture is the simplest data warehouse design. The main feature of a star
schema is a table at the center, called the fact table and the dimension tables which allow
browsing of specific categories, summarizing, drill-downs and specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form,
while dimensional tables are de-normalized (second normal form).

Fact table

The fact table is not a typical relational database table as it is de-normalized on purpose -
to enhance query response times. The fact table typically contains records that are ready
to explore, usually with ad hoc queries. Records in the fact table are often referred to as
events, due to the time-variant nature of a data warehouse environment.
The primary key for the fact table is a composite of all the columns except numeric
values / scores (like QUANTITY, TURNOVER, exact invoice date and time).

iGATE Internal
Typical fact tables in a global enterprise data warehouse are (apart for those, there may be
some company or business specific fact tables):

sales fact table - contains all details regarding sales

orders fact table - in some cases the table can be split into open orders and historical
orders. Sometimes the values for historical orders are stored in a sales fact table.
budget fact table - usually grouped by month and loaded once at the end of a year.
forecast fact table - usually grouped by month and loaded daily, weekly or monthly.
inventory fact table - report stocks, usually refreshed daily

Dimension table

Nearly all of the information in a typical fact table is also present in one or more
dimension tables. The main purpose of maintaining Dimension Tables is to allow
browsing the categories quickly and easily.
The primary keys of each of the dimension tables are linked together to form the
composite primary key of the fact table. In a star schema design, there is only one de-
normalized table for a given dimension.

time dimension table

customers dimension table
products dimension table
key account managers (KAM) dimension table
sales office dimension table
Star schema example

iGATE Internal
An example of a star schema architecture is depicted below.

SNOWFLAKE SCHEMA
Snowflake schema architecture is a more complex variation of a star schema design. The
main difference is that dimensional tables in a snowflake schema are normalized, so they
have a typical relational database design.

Snowflake schemas are generally used when a dimensional table becomes very big and
when a star schema can’t represent the complexity of a data structure. For example if a
PRODUCT dimension table contains millions of rows, the use of snowflake schemas
should significantly improve performance by moving out some data to other table (with
BRANDS for instance).

The problem is that the more normalized the dimension table is, the more complicated
SQL joins must be issued to query them. This is because in order for a query to be
answered, many tables need to be joined and aggregates generated.

An example of a snowflake schema architecture is depicted below.

iGATE Internal
GALAXY SCHEMA

For each star schema or snowflake schema it is possible to construct a fact constellation
schema.
This schema is more complex than star or snowflake architecture, which is because it
contains multiple fact tables. This allows dimension tables to be shared amongst many
fact tables.
That solution is very flexible, however it may be hard to manage and support.

The main disadvantage of the fact constellation schema is a more complicated design
because many variants of aggregation must be considered.

In a fact constellation schema, different fact tables are explicitly assigned to the
dimensions, which are for given facts relevant. This may be useful in cases when some
facts are associated with a given dimension level and other facts with a deeper dimension
level.

Use of that model should be reasonable when for example, there is a sales fact table (with
details down to the exact date and invoice header id) and a fact table with sales forecast
which is calculated based on month, client id and product id.

In that case using two different fact tables on a different level of grouping is realized
through a fact constellation model.

iGATE Internal
Data Integration Challenge – Understanding Lookup Process–
I

One of the basic ETL steps that we would use in most of the ETL jobs during
development is ‘Lookup’. We shall discuss further on what lookup is? when to use? how
it works ? and some points to be considered while using a lookup process.

What is lookup process?

if we query another table or file (called ‘lookup table’ or ‘lookup file’) for retrieving
additional data then its called a ‘lookup process’. The ‘lookup table or file’ can reside on
the target or the source system. Usually we pass one or more column values that has been
read from the source system to the lookup process in order to filter and get the required
data.

There are three ways ETL products perform ‘lookup process’

 Direct Query: Run the required query against the table or file whenever the
‘lookup process’ is called up
 Join Query: Run a query joining the source and the lookup table/file before
starting to read the records from the source.
 Cached Query: Run a query to cache the data from the lookup table/file local to
the ETL server as a cache file. When the data flow from source then run the
required query against the cache file whenever the ‘lookup process’ is called up

Most of the leading products like Informatica, DataStage support all the three ways in
their product architecture. We shall see the pros and cons of this process and how these
work in part II.

ransformation : Normalizer (nrm_) Explanation : Normalizer transformation is a active

and connected transformation.

DESIGNER ::

Use the Normalizer transformation with COBOL sources, which are often stored in a
denormalized format
You can also use the Normalizer transformation with relational sources to create multiple
rows from a single row of data.

iGATE Internal
MAPPING ::

Objective : To create a mapping which converts a single row into multiple rows.

Mapping Flow : Source Definition (Flat File) > Source Qualifier > Expression (column
names) > Normalizer transformation (converts single row into multiple rows)> Target
Definition (flat file)

Designing : We designed this mapping using INFORMATICA tool version 7.1.1 .

Description :

Source Definition

ENO ENAME ESAL EAGE

100 JOHN 2000 23
101 SMITH 5000 41
102 LUCKY 6000 32

Target Definition

ENO COL_NAM COL_VAL

100 ENAME JOHN
100 ESAL 2000
100 EAGE 23
101 ENAME SMITH
101 ESAL 5000
101 EAGE 41
102 ENAME LUCKY
102 ESAL 6000
102 EAGE 32

XML FILE DESIGNER
m_norm_col_rows Single row into multiple rows

iGATE Internal
Exceptions in Informatica – 2
Let us see few more strange exceptions in Informatica

1. Sometimes the Session fails with the below error message.

“FATAL ERROR : Caught a fatal signal/exception
FATAL ERROR : Aborting the DTM process due to fatal signal/exception.”

There might be several reasons for this. One possible reason could be the way the
function SUBSTR is used in the mappings, like the length argument of the SUBSTR
function being specified incorrectly.
Example:

IIF(SUBSTR(MOBILE_NUMBER, 1, 1) = ‘9′,
SUBSTR(MOBILE_NUMBER, 2, 24),
MOBILE_NUMBER)

In this example MOBILE_NUMBER is a variable port and is 24 characters long.

When the field itself is 24 char long, the SUBSTR starts at position 2 and go for a length
of 24 which is the 25th character.

To solve this, correct the length option so that it does not go beyond the length of the
field or avoid using the length option to return the entire string starting with the start
value.
Example:

In this example modify the expression as follows:

IIF(SUBSTR(MOBILE_NUMBER, 1, 1) = ‘9′,
SUBSTR(MOBILE_NUMBER, 2, 23),
MOBILE_NUMBER)

OR

IIF(SUBSTR(MOBILE_NUMBER, 1, 1) = ‘9′,
SUBSTR(MOBILE_NUMBER, 2),
MOBILE_NUMBER).

2. The following error can occur at times when a session is run

“TE_11015 Error in xxx: No matching input port found for output port OUTPUT_PORT
TM_6006 Error initializing DTM for session…”

Where xxx is a Transformation Name.

iGATE Internal
This error will occur when there is corruption in the transformation.
To resolve this do one of the following: * Recreate the transformation in the mapping
having this error.

3. At times you get the below problems,

1. When opening designer, you get “Exception access violation”, “Unexpected condition
detected”.

2. Unable to see the navigator window, output window or the overview window in
designer even after toggling it on.

3. Toolbars or checkboxes are not showing up correctly.

These are all indications that the pmdesign.ini file might be corrupted. To solve this,
following steps need to be followed.

1. Close Informatica Designer

2. Rename the pmdesign.ini (in c:\winnt\system32 or c:\windows\system).
3. Re-open the designer.

When PowerMart opens the Designer, it will create a new pmdesign.ini if it doesn’t find
an existing one. Even reinstalling the PowerMart clients will not create this file if it finds
one.

Informatica Exceptions – 3

Here are few more Exceptions:

1. There are occasions where sessions fail with the following error in the Workflow
Monitor:

“First error code [36401], message [ERROR: Session task instance [session XXXX]:
Execution terminated unexpectedly.] “

“LM_36401 Execution terminated unexpectedly.”

iGATE Internal
To determine the error do the following:

a. If the session fails before initialization and no session log is created look for errors in
Workflow log and pmrepagent log files.

b. If the session log is created and if the log shows errors like

“Unexpected condition detected at file [xxx] line yy”

then a core dump has been created on the server machine. In this case Informatica
Technical Support should be contacted with specific details. This error may also occur
when the PowerCenter server log becomes too large and the server is no longer able to
write to it. In this case a workflow and session log may not be completed. Deleting or
renaming the PowerCenter Server log (pmserver.log) file will resolve the issue.

2. Given below is not an exception but a scenario which most of us would have come
across.

Rounding problem occurs with columns in the source defined as Numeric with Precision
and Scale or Lookups fail to match on the same columns. Floating point arithmetic is
always prone to rounding errors (e.g. the number 1562.99 may be represented internally
as 1562.988888889, very close but not exactly the same). This can also affect functions
that work with scale such as the Round() function. To resolve this do the following:

a. Select the Enable high precision option for the session.

b. Define all numeric ports as Decimal datatype with the exact precision and scale
desired. When high precision processing is enabled the PowerCenter Server support
numeric values up to 28 digits. However, the tradeoff is a performance hit (actual
performance really depends on how many decimal ports there are).

Exceptions in Informatica

There exists no product/tool without strange exceptions/errors, we will see some of those
exceptions.
1. You get the below error when you do “Generate SQL” in Source Qualifier and try to
validate it.
“Query should return exactly n field(s) to match field(s) projected from the Source
Qualifier”
Where n is the number of fields projected from the Source Qualifier.

Possible reasons for this to occur are:

iGATE Internal
1. The order of ports may be wrong
2. The number of ports in the transformation may be more/less.
3. Sometimes you will have the correct number of ports and in correct order too but even
then you might face this error in that case make sure that Owner name and Schema name
are specified correctly for the tables used in the Source Qualifier Query.
E.g., TC_0002.EXP_AUTH@TIMEP

2. The following error occurs at times when an Oracle table is used

“[/export/home/build80/zeusbuild/vobs/powrmart
Where xxx is some line number mostly 241, 291 or 416.

1. Use DataDirect Oracle ODBC driver instead of the driver “Oracle in

2. If the table has been imported using the Oracle drivers which are not supported, then
columns with Varchar2 data type are replaced by String data type and Number columns
are imported with precision Zero(0).

Unexpected Condition Detected

Warning: Unexpected condition at: statbar.cpp: 268
Contact Informatica Technical Support for assistance

When there is no enough memory in System this happens. To resolve this we can either

1. Increase the Virtual Memory in the system

2. If continue to receive the same error even after increasing the Virtual
Memory, in Designer, go to ToolsàOptions, go to General tab and clear the
“Save MX Data” option.

%E2%80%93-understanding-lookup-process-%E2%80%93-iii.html

Ans DIMENSION TABLE FACT TABLE

It provides the context /descriptive It provides measurement of an enterprise.
information for a fact table

iGATE Internal
measurements.

Structure of Dimension - Surrogate Measurement is the amount determined by

key , one or more other fields that observation.
compose the natural key (nk) and set
of Attributes.
Size of Dimension Table is smaller Structure of Fact Table - foreign key (fk),
than Fact Table. Degenerated Dimension and Measurements.

. In a schema more number of Size of Fact Table is larger than Dimension

dimensions are presented than Fact Table.
Table.
Surrogate Key is used to prevent the In a schema less number of Fact Tables
primary key (pk) violation(store observed compared to Dimension Tables.
historical data).
Provides entry points to data. Compose of Degenerate Dimension fields act
as Primary Key.
Values of fields are in numeric and Values of the fields always in numeric or
text representation. integer form.

==

Ans DATAMART DATA WAREHOUSE

A scaled - down version of the Data It is a database management system that
Warehouse that addresses only one facilitates on-line analytical processing by
subject like Sales Department, HR allowing the data to be viewed in different
Department etc., dimensions or perspectives to provide
One fact table with multiple dimension More than one fact table and multiple
tables. dimension tables.
[Sales Department] [HR Department] [Sales Department , HR Department ,
[Manufacturing Department] Manufacturing Department]
Small Organizations prefer Bigger Organization prefer DATA
DATAMART WAREHOUSE

Ans DATA MINING WEB MINING

iGATE Internal
Data mining involves using techniques Web mining involves the analysis of Web
to find underlying structure and server logs of a Web site.
relationships in large amounts of data.
Data mining products tend to fall into The Web server logs contain the entire
five categories: neural networks, collection of requests made by a potential
knowledge discovery, data visualization, or current customer through their browser
fuzzy query analysis and case-based and responses by the Web server
reasoning.

OLTP VS OLAP

Ans
OLTP OLAP
On Line Transaction processing On Line Analytical processing
Tables are in normalized form Partially Normalized / Denormalized Tables
Single record access Multiple records for analysis purpose
Holds current data Holds current and historical data
Records are maintained using Primary key Records are baased on surogate keyfield
feild
Delete the table or record Cannot delete the records
Complex data model Simplified data model

 Data Integration Challenge – Understanding Lookup Process

– III

In Part II we discussed ‘when to use’ and ‘when not to use’ the particular type of lookup
process, the Direct Query lookup, Join based lookup and the Cache file based lookup.
Now we shall see what are the points to be considered for better performance of these
‘lookup’ types.
In the case of Direct Query the following points are to be considered
 Index on the lookup condition columns
 Selecting only the required columns

In the case of Join based lookup, the following points are to be considered
 Index on the columns that are used as part of Join conditions
 Selecting only the required columns

iGATE Internal
In the case of Cache file based lookup, let us first try to understand the process of how
these files are built and queried.
The key aspects of a Lookup Process are the
 SQL that pulls the data from lookup table
 Cache memory/files that holds the data
 Lookup Conditions that query the cache memory/file
 Output Columns that are returned back from the cache files

Cache file build process:

Based on the product Informatica or Datastage when a lookup process is being designed
we would define the ‘lookup conditions’ or the ‘key fields’ and also define a list of fields
that would need to be returned on lookup query. Based on these definitions the required
data is pulled from lookup table and the cache file is populated with the data. The cache
file structure is optimized for data retrieval assuming that the cache file would be queried
based certain set of columns called ‘lookup conditions’ or ‘key fields’.

In the case of Informatica, the cache file is of separate index and data file, the index file
has the fields that are part of the ‘lookup condition’ and the data file has the fields that are
to be returned. Datastage cache files are called Hash files which are optimized based on
the ‘key fields’.

Cache file query process:

Irrespective of the product of choice following would be the steps involved internally
when a lookup process is invoked.

Process:
1. Get the Inputs for Lookup Query, Lookup Condition and Columns to be returned
2. Load the cache file to memory
3. Search the record(s) matching the Lookup condition values , in case of
Informatica this search happens on the ‘index file’
4. Pull the required columns matching the condition and return, in case of
Informatica with the result from ‘index file’ search, the data from the ‘data file’ is
located and retrieved

In the search process, based on the memory availability there could be many disk hits and
page swapping.

So in terms performance tuning we could look at two levels

1. how to optimize the cache file building process
2. how to optimize cache file query process

The following table lists the points to be considered for the better performance of a cache
file based lookup
Category Points to consider
Optimize Cache file • While retrieving the records to build the cache file, sort the
building process records by the lookup condition, this sorting would speed up

iGATE Internal
the index (file) building process. This is because the search
tree of the Index file would be built faster with lesser node
realignment
file size
• Reusing the same cache file for multiple requirements for
same or slightly varied lookup conditions
• Sort the records that come from source to query the cache
file by the lookup condition columns, this ensures less page
swapping and page hits. If the subsequent input source
records come in a continuous sorted order then the hits of
the required index data in the memory is high and the disk
Optimize Cache file
swapping is reduced
query process
• Having a dedicated separate disk ensures a reserved space
for the lookup cache files and also improves response of
writing to the disk and reading from the disk
• Avoid querying recurring lookup condition, by sorting the
incoming records by the lookup condition

= Data Integration Challenge – Capturing Changes

When we receive the data from source systems, the data file will not carry a flag
indicating whether the record provided is new or has it changed. We would need to build
process to determine the changes and then push them to the target table.

1. Pull the incremental data from the source file or table

2. Process the pulled incremental data and determine the impact of it on the target
table as Insert or Update or Delete

Step 1: Pull the incremental data from the source file or table
If source system has audit columns like date then we can find the new records else we
will not be able to find the new records and have to consider the complete data
For source system’s file or table that has audit columns, we would follow the below steps

1. While reading the source records for a day (session), find the maximum value of
date(audit filed) and store in a persistent variable or a temporary table
2. Use this persistent variable value as a filter in the next day to pull the incremental
data from the source table

Step 2: Determine the impact of the record on target table as Insert/Update/ Delete
Following are the scenarios that we would face and the suggested approach

iGATE Internal
1. Data file has only incremental data from Step 1 or the source itself provide only
incremental data
o do a lookup on the target table and determine whether it’s a new record or
an existing record
o if an existing record then compare the required fields to determine whether
it’s an updated record
o have a process to find the aged records in the target table and do a clean up
for ‘deletes’
2. Data file has full complete data because no audit columns are present
o The data is of higher
 have a back up of the previously received file
 perform a comparison of the current file and prior file; create a
‘change file’ by determining the inserts, updates and deletes.
Ensure both the ‘current’ and ‘prior’ file are sorted by key fields
 have a process that reads the ‘change file’ and loads the data into
the target table
 based on the ‘change file’ volume, we could decide whether to do a
o The data is of lower volume
 do a lookup on the target table and determine whether it’s a new
record or an existing record
 if an existing record then compare the required fields to determine
whether it’s an updated record
 have a process to find the aged records in the target table and do a
clean up or delete

 Data Integration Challenge – Storing Timestamps

Storing timestamps along with a record indicating its new arrival or a change in its value
is a must in a data warehouse. We always take it for granted, adding timestamp fields to
table structures and tending to miss that the amount of storage space a timestamp field
can occupy is huge, the storage occupied by timestamp is almost double against a integer
data type in many databases like SQL Server, Oracle and if we have two fields one as
insert timestamp and other field as update timestamp then the storage spaced required
gets doubled. There are many instances where we could avoid using timestamps
especially when the timestamps are being used for primarily for determining the
incremental records or being stored just for audit purpose.

How to effectively manage the data storage and also leverage the benefit of a
timestamp field?

One way of managing the storage of timestamp field is by introducing a process id field
and a process table. Following are the steps involved in applying this method in table
structures and as well as part of the ETL process.
Data Structure

iGATE Internal
1. Consider a table name PAYMENT with two fields with timestamp data type like
INSERT_TIMESTAMP and UPDATE_TIEMSTAMP used for capturing the
changes for every present in the table
2. Create a table named PROCESS_TABLE with columns PROCESS_NAME
Char(25), PROCESS_ID Integer and PROCESS_TIMESTAMP Timestamp
3. Now drop the fields of the TIMESTAMP data type from table PAYMENT
4. Create two fields of integer data type in the table PAYMENT like
INSERT_PROCESS_ID and UPDATE_PROCESS_ID
5. These newly created id fields INSERT_PROCESS_ID and
UPDATE_PROCESS_ID would be logically linked with the table
PROCESS_TABLE through its field PROCESS_ID
ETL Process
1. Let us consider an ETL process called ‘Payment Process’ that loads data into the
table PAYMENT
2. Now create a pre-process which would run before the ‘payment process’, in the
pre-process build the logic by which a record is inserted with the values like
(‘payment process’, SEQUNCE Number, current timestamp) into the
PROCESS_TABLE table. The PROCESS_ID in the PROCESS_TABLE table
could be defined as a database sequence function.
3. Pass the currently generated PROCESS_ID of PROCESS_TABLE as
‘current_process_id’ from pre-process step to the ‘payment process’ ETL process
4. In the ‘payment process’ if a record is to inserted into the PAYMENT table then
the current_prcoess_id value is set to both the columns INSERT_PROCESS_ID
and UPDATE_PROCESS_ID else if a record is getting updated in the PAYMENT
table then the current_process_id value is set to only the column
UPDATE_PROCESS_ID
5. So now the timestamp values for the records inserted or updated in the table
PAYMENT can be picked from the PROCESS_TABLE by joining by the
PROCESS_ID with the INSERT_PROCESS_ID and UPDATE_PROCESS_ID
columns of the PAYMENT table
Benefits
 The fields INSERT_PROCESS_ID and UPDATE_PROCESS_ID occupy less
space when compared to the timestamp fields
 Both the columns INSERT_PROCESS_ID and UPDATE_PROCESS_ID are
Index friendly
 Its easier to handle these process id fields in terms picking the records for
determining the incremental changes or for any audit reporting.
TRANSFORMATION DEVELOPER VS MAPPLET DESIGNER

Ans TRANSFORMATION
MAPPLET DESIGNER
DEVELOPER
Used to create reusable
Used to create mapplets
transformation
Reusable transformation :: It contains a set of transformations
Reusable transformations can be and allows you to reuse that

iGATE Internal
transformation logic in multiple
used in multiple mappings.
mappings.

ETLTOOL VS REPORTING TOOL

Ans
ETL TOOL REPORTING TOOL
ETL - Extract, Transform, Load data Reporting :: To generate reports
Cleansing data,Validating data,Business Convert ETL data into Cubes, Pie charts,
Logic on the data graphs etc.,
Developed by Reporting developer .End
Developed by ETL developer user access the data in terms of Business
reports
Tools :: Informatica, Data Stage, Oracle Tools :: Business Objects, Hyperion,
Warehouse etc., Cognos etc.,

NORMALIZATION VS DENORMALIZATION

Ans NORMALIZATION DENORMALIZATION

Normalization: is a gradual process Denormalization: is the reverse of
of removing redundancies of normalization wherein the emphasis is
attributes in a data structure. The on increasing redundancies. This is
condition of the data at completion ofdone for performance enhancement
each step is described as a "normal and reduced query time.
form." Normalization is usually
considered to be a good way to
design a database, however it
involves high CPU time to process a
Dimensional tables are in
Fact table in Normalization form.
denormalized form.

ACT TABLE VS DIMENSION TABLE

Ans
FACT TABLE DIMENSION TABLE
A table in a data warehouse whose A dimensional table is a collection of
entries describe data in a fact table. hierarchies and categories along which
Dimension tables contain the data from the user can drill down and drill up. it
which dimensions are created. A fact contains only the textual attributes.
table in data ware house is it describes
the transaction data. It contains

iGATE Internal
characteristics and key figures.

In a Data Model schema less number of In a Data Model schema more number of
fact tables are observed. dimensional tables are observed.

RDBMS SCHEMA VS DWH SCHEMA

Ans
RDBMS SCHEMA DWH SCHEMA
* Used for OLTP systems * Used for OLAP systems
* Traditional and old schema * New generation schema
* Normalized * Denormalized
* Difficult to understand and navigate * Easy to understand and navigate
* Cannot solve extract and complex * Extract and complex problems can be
problems easily solved
* Poorly modelled * Very good model

There are a lot of factors to be taken into consideration before deciding which database is
better.
If you are talking about OLTP systems then Oracle is far better than Teradata.
Oracle is more flexible in terms of programming like u can write
Packages,procedures,functions .
Teradata is useful if you want to generate reports on a very huge database.
But the recent versions of Oracle like 10g is quite good & contains a lot of features to
support DataWareHouse.

 Teradata is a MPP System which really can process the complex queries very
fastly..Another advantage is the uniform distribution of data through the Unique primary
indexes with out any overhead. Recently we had an evaluation with experts from both
Oracle and Teradata for OLAP system,and they were really impressed with the