Vous êtes sur la page 1sur 23

ETL Software Development

ITETL0001
Process and ETL Toolkit

TOOLKIT – Part 1

Dell Confidential
ETL

• ETL - EXTRACT -> TRANSFORM -> LOAD


– Extract
• Extract from source system.
• Typically done by source system owner
• Not typically done by our team
• End Result is flat files stored on our central NFS server
– Transform
• Transformation is application of business rules and other modifications to
facilitate reporting.
• This is the main purpose of the ETL team
• Abinitio is a proprietary software package we use for Transformation
– Load
• Data is loaded to the reporting database (typically TD)
• Other manipulation/summary might be done within the dB (aka rollups)

2
Application of Ab Initio

• Transformation of disparate sources


• Aggregation
• Referential integrity checking
• Transformations for business rules
• Database loading
• Aggregation for mart tables
• Extraction

3
GDE and Co-Operating System

• GDE (1.13.7)
• Co-Operation System (2.13)
Sun Solaris Red Hat Linux

IBM AIX Windows NT

HP-UX Compaq Tru64 UNIX

IBM DYNIX/ptx IBM OS/390

Silicon Graphics IRIX NCR MP-RAS

4
Databases

• Oracle
• Sybase
• Teradata
• MS SQL Server 7

5
Co-Operating System Services

• Parallel and distributed application execution.


• Transactional semantic at the application level (not in DB)
• Checkpointing
• Monitoring and Debugging
• Parallel file management
• Parameter driven components

6
Lab 1: Setting up Linux directories

• Create your directory under /usr/dell/abinitio/training on abidev01


using your NT account

cd /usr/dell/abinitio/training

mkdir your_name

• Create a data directory under /serial/data/ddw/serial/training/


mkdir your_name

cd your_name

mkdir lab1

7
Lab 1: Logon via GDE

Create a login account from GDE

• host=abidev01

• login= your NT account

• host_directory=/usr/dell/abinitio/training/your_name

• location=/usr/dell/abinitio/abinitio

8
Lab 1: Create an Empty Sandbox

• Create an empty sandbox


Projects->create sandbox
Directory=/usr/dell/abinitio/training/your_name/lab1

• Go to the sandbox in Linux and do an ls -l


xfr
run
mp
dml
db
**The goal is to have a development environment which enables the migration
of a graph or set of graphs to any other environment/location without any
changes.
** Save the graph as build_customer.mp in your mp directory

9
Lab 1: Collection Files

• Collection files are located in /usr/dell/abinitio/training/collection


• customer_header.dat
cust_number
cust_name
birth_year
birth_month

birth_day

• customer_detail.dat
cust_number
cust_address
cust_zip
address_type

10
Lab 1: Edit Sandbox

• Edit the sandbox to define the sandbox variables


Project->Edit Sandbox

Add AI_ prefix to all the sandbox directories

Define AI_SERIAL_OUT_DATA=/serial/data/ddw/serial/training

Define COLL_HOME=/usr/dell/abinitio/training/collection
$AI_RUN – run directory usually contains deployed shells

$AI_DML – record format files

$AI_XFR – transform files

$AI_MP – graphs

$AI_DB – database config files

11
Lab 1: Transform Requirements

• Lab1- Transformation requirements:

1. Customer header records should be de-duped on customer


number.

2. For customer number all the leading zeros should be removed.

3. The resulting final file should left outer join customer header
and customer detail.

4. The resulting final file the date field should be consolidated as


YYYYMMDD.

5. The resulting final file should have all the columns from header
and detail.

12
Lab 1: DML for Customer Header

Input file component with URL pointing to $COLL_HOME/ customer_header.dat


and dml port to have the below dml

record

string("~|") cust_number = NULL("");

string("~|") cust_name = NULL("");

string("~|") birth_year = NULL("");

string("~|") birth_month = NULL("");

string("\n") birth_day = NULL("");

end;

13
Lab 1: DML for Customer Detail

Input file component with URL pointing to $COLL_HOME/ customer_detail.dat


and dml port to have the below dml

record

string("~|") cust_number = NULL("");

string("~|") cust_address = NULL("");

string("~|") cust_zip = NULL("");

string("~|") address_type = NULL("");

string(1) newline;

end;

??? Why is newline defined explicitly in customer detail and not in


customer header
14
Lab 1: Reformat

• Add a reformat for customer header having the output with date
as YYYYMMDD

OUTPUT DML:

record

string("~|") cust_number = NULL("");

string("~|") cust_name = NULL("");

string("\n") date_of_birth = NULL("");

end;

• Add a reformat for customer detail to remove the leading zero's


from cutomer number.
15
Lab 1: Sort and Dedup

• Sort and Dedup customer header on customer number


• Sort customer detail on customer number
Sort and Dedup keys should be the same. Why is it so? How will
it behave in multi file system?

16
Lab 1: Join

• Do a left outer join on customer number for the header and detail records.
The final file should have all the records from header and detail:
record
string("~|") cust_number = NULL("");
string("~|") cust_name = NULL("");
string("~|") date_of_birth = NULL("");
string("~|") cust_address = NULL("");
string("~|") cust_zip = NULL("");
string("~|") address_type = NULL("");
string(1) newline;
end;

17
Lab 1: More Thoughts on Join

• How will you do an inner join?

• How will you do a full outer join?

• How will you do a right outer join?

• How can you override the join keys?

• How can you manipulate the output through join transform?

• Why is “\n” hard coded?

18
Lab 1: Loading in TD

CREATE MULTISET TABLE test1.lab1_your_name ,NO FALLBACK ,


NO BEFORE JOURNAL,

NO AFTER JOURNAL
(
CUST_NUMBER INTEGER NOT NULL,
CUST_NAME VARCHAR(18) CHARACTER SET LATIN NOT CASESPECIFIC,

DATE_OF_BIRTH DATE FORMAT 'yyyy-mm-dd',


CUST_ADDRESS VARCHAR(40) CHARACTER SET LATIN NOT CASESPECIFIC,
CUST_ZIP INTEGER,
ADDRESS_TYPE CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC

)
PRIMARY INDEX XNUP_lab1_your_name ( cust_number );

19
Lab 1: Loading in TD

UTILITIES:

• TD_DML_GENERATOR: generates the Ab Initio DML for a Teradata table. Copy the
utility from /usr/dell/abinitio/training/utils/td_dml_generator.ksh into your local utils
directory.

• Getpasswd: Returns the password for the oracle username for a particular instance.
The configuration management has to add the entry of the username and you need
to have permissions to access the password.

• Gettdpasswd: Does the same as getpasswd for Teradata

Example: gettdpasswd ddwdev us_svc_tag_etl

20
Lab 1: Loading in TD

Accessing the DDW developed TD components

To access TD we use our own Ab Initio components. In order to use them we need to add
the folder to the component organizer of GDE.

• Right Click in Component Organizer.

• Select New->Top Level Folder

• In the box, select HOST and give the path /usr/dell/abinitio/Teradata/components

21
Lab 1: Loading in TD

Load the final file in Teradata in table test1.lab1_your_name:

• Generate DML from the table using td_dml_generator.ksh and


name it r_customer.dml

• Add a Reformat before loading to change the delimiter to Ç and


the date format to YYYY-MM-DD

• Use DDW_INSERT to load the data

• Add TD_LOGON in the sandbox

• Execute .profile from settings or start script

• Save it in a separate graph called upload_customer.mp

22
Lab 1: Order of parameter evaluation

1. The host setup script is run


2. Common project parameters
3. Project/sandbox parameters
4. Graph parameters
5. Graph start script parameter

23

Vous aimerez peut-être aussi