Vous êtes sur la page 1sur 7

DataStage Best Practices

SCD II implementation in
DataStage 7.X







A Division of Tata Sons Limited
11
th
Floor, Air India Building, Nariman Point, Mumbai 400021, INDIA

TCS-Internal
DataStage Best Practices


TCS-Internal
Contents
SLOWLY CHANGING DIMENSIONS- OVERVIEW................................................................................3
TYPE 1 SLOWLY CHANGING DIMENSION:...........................................................................................3
TYPE 2 SLOWLY CHANGING DIMENSION:...........................................................................................4
TYPE 3 SLOWLY CHANGING DIMENSION:...........................................................................................5
IMPLEMENTING SCD II IN DATASTAGE 7.X: .......................................................................................5
STEPS TO BE FOLLOWED FOR IMPLEMENTING SCD II: .................................................................6


DataStage Best Practices


TCS-Internal
Slowly Changing Dimensions- Overview
The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a
nutshell, this applies to cases where the attribute for a record varies over time. We give an example
below:
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the
customer lookup table has the following record:
Customer Key Name State
1001 Christina Illinois
At a later date, she moved to Los Angeles, California on J anuary 2003. How should ABC Inc. now
modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem.
There are in general three ways to solve this type of problem, and they are categorized as follows:
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 2: A new record is added into the customer dimension table. Therefore, the customer is
treated essentially as two people.
Type 3: The original record is modified to reflect the change.
We next take a look at each of the scenarios and how the data model and the data looks like for
each of them. Finally, we compare and contrast among the three alternatives.

Type 1 Slowly Changing Dimension:
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original
information. In other words, no history is kept.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
After Christina moved from Illinois to California, the new information replaces the new record, and
we have the following table:
Customer Key Name State
1001 Christina California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need
to keep track of the old information.
Disadvantages:
DataStage Best Practices


TCS-Internal
- All history is lost. By applying this methodology, it is not possible to trace back in history. For
example, in this case, the company would not be able to know that Christina lived in Illinois before.
Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse
to keep track of historical changes.

Type 2 Slowly Changing Dimension:
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new
information. Therefore, both the original and the new record will be present. The new record gets its
own primary key.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
After Christina moved from Illinois to California, we add the new information as a new row into the
table:
Customer Key Name State
1001 Christina Illinois
1005 Christina California
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the table
is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to
track historical changes.

DataStage Best Practices


TCS-Internal
Type 3 Slowly Changing Dimension:
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute
of interest, one indicating the original value, and one indicating the current value. There will also be
a column that indicates when the current value becomes active.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
Customer Key
Name
Original State
Current State
Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we have
the following table (assuming the effective date of change is J anuary 15, 2003):
Customer Key Name Original State Current State Effective Date
1001 Christina Illinois California 15-J AN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For
example, if Christina later moves to Texas on December 15, 2003, the California information will be
lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data
warehouse to track historical changes, and when such changes will only occur for a finite number
of time.


Implementing SCD II in Datastage 7.X:

DataStage Best Practices


SCD II can be achieved in Datasatge with the help of two fields Effective date (EFCT_DT)
and Expiry date (EXPR_DT). Through these two fields we can keep track of the active records and
expired records.
For identifying the active records we need to assign a standard high data EXPR_DT (e.g.
2999-12-31).We can assume to use 2999-12-31 as EXPR_DT for active records in our example.
So for identifying the active records you need to query the table for records with EXPR_DT=2999-
12-31.
For identifying the expired records we need to query the table for records with
EXPR_DT<>2999-12-31.
The following block diagram shows the layout to implement SCD II in Datastage:







TCS-Internal
Steps to be followed for implementing SCD II:

Read the incoming records through any input stage like sequential file/dataset/table.
Do the required processing for the incoming data.
After the above processing step, pass the data into the change capture stage.
The change capture should be having two input links- one is the before dataset and the
other is the after dataset. For our job, the before dataset should be the active records
present in the table. The active records are all those records which are having
EXPR_DT=2999-12-31. The after dataset will be the incoming data passed into change
capture after all the necessary processing.
The change capture stage compared the before dataset and after dataset and produces 4
change_codes for each of the records. The 4 change codes are as follows:
Change capture
stage to capture
the changed
records from the
input.
Pass all the
active records
from the table.
i.e. all the
records with
EXPR_DT=299
9-12-31
Filter the edit, insert
and update records.
Records with change
code 2 and 3 are
passed into red arrow
flow and records with
change code 1 and 3
are passed into upper
arrow flow.
Do whatever
processing you
want for the
incoming date.
Insert the
new
records
into the
table.
Update or
expire the
records in
target table
Lookup the
target table
and get the
records to be
updates or
expired
Target
table
Incoming records
which include
updated records,
insert records and
records same as
tablerecords
DataStage Best Practices


TCS-Internal
o 0 - Copy code (The code indicates the after record is a copy of the before record)
o 1-Insert code (The code indicates a new record has been inserted in the after set
that did not exist in the before set.)
o 2-Delete code(The code indicates that a record in the before set has been
deleted from the after set)
o 3-Edit code(the code indicates the after record is an edited version of the before
record)
The copy records are not passed in the change captured stage as since we need only
edited, insert records fro SCD II implementation.
Use a filter stage to separate the records that needs to be expired and inserted.
Filter the records with change_code =1 or 3 into the insert records link. Filter the records
with change_code= 3 into update/expiry link.
The records with change_code=3 are edited records. So the original records corresponding
to these edited records are to be made in-active (expired). We can make the records
inactive by changing the EXPR_DT<>2999-12-31.So to make the record inactive change
the EXPR_DT with a valid date. For e.g. you can use make the EXPR_DT as the date one
less than the date on which you are loading the data into the table. We will assume that we
are loading the data on 2008-08-15.So the EXPR_DT for inactive records would become
2008-08-14. The date 2008-08-15 can be made as the EFCT_DT for records to be
inserted.
To get the original records which needs to be expired, look-up the target table for all the
records with change_code=3 which are filtered out separately. Get the original record along
with the EFCT_DT of the original record. Then update the records EXPR_DT to 2008-08-
14 in the table. Now the original records are made inactive (expired).
The new updated record (change_code=3) needs to be in table along with the new insert
records(change_code=1).This data is filtered out from the filter stage and inserted into the
table with EFCT_DT=Data of loading i.e. 2008-08-15 and EXPR_DT=2999-12-31

Vous aimerez peut-être aussi