Vous êtes sur la page 1sur 5

Effective Usage of Flat Files with Minimum Size

Ramadevi Chepuru January, 2008

About the Author: Ramadevi Chepuru is a senior Informatica and data warehouse programmer. She is a certified Oracle DBA professional and has more than 10 years of data warehouse and informatica ETL experience. Ramadevi has multidisciplinary skills in Information Technology and in Organic Chemistry. Ramadevi had doctorate (Ph. D) degree in Organic Chemistry and had published number of papers and patents internationally. She had written white papers in Information Technology too. Currently, she is working for Tek Systems at an AT&T client as a Senior Technical Architect. Ramadevi has written this white paper, primarily focusing on informatica ETL developers and data warehouse programmers. Introduction: In a data warehouse environment the most common task for an ETL programmer is to either generate a flat file output or to use a flat file source to load data. ETL programmers can easily generate staged target file using Informatica session, which can later be compressed to minimize file size. The default informatica process of generating an uncompressed target file is not always suitable when handing large amount of source data, especially if you dont have enough space on UNIX filesystem. This paper provides an alternate solution to this problem by creating compressed target file directly using an informatica session. The same way to load flat file source data into target using default informatica session, one need uncompressed source flat files. This default process is not useful, if you want to handle large source files and you dont have enough space on UNIX file system to uncompress them before loading. Again, this paper provides an alternate solution to use compressed source file directly to load into target thru informatica session. Process Overview Environment / Compatibility: This strategy was tested with informatica 7.1.1 and 8.1.4 applications, deployed on UNIX server (Sun Solaris, Version 8). Source/Target database are on Oracle 9i. The same 12.5 million source records were used to test different scenarios mentioned in this paper.

This paper describes how to generate a compressed target file directly from informatica. The same logic is further expanded in generating multiple compressed target files using informatica partition session options to improve the load performance. This paper also outlines in detail, how to utilize compress source file directly thru informatica. The same logic is further expanded in using multiple compressed source files directly along with informatica partition session options to improve the load performance. The overall methodology described in this paper is based on the combination of well know UNIX commands and informatica session features. These concepts are highly useful for informatica ETL programmers / data warehouse programmers.

Generate Compressed Target Flat Files Directly: Scenario 1: Straight forward informatica workflow was run to generate staged target out file on UNIX server by reading source data from oracle table. This process took nearly 12 min. to load 12.5 million source records and created target flat file with 1385 MB size. Scenario 2: To minimize the target flat file size, a command task was added to the workflow as a pre session task. This command task is configured to call a shell script on the UNIX server. This script creates a pipe file with the same target file name, mentioned in the informatica session and also starts a background compress process using the pipe file input to create a compressed target flat file. This workflow ran in 13 min. to load same 12.5 million source records, but created a compressed target file directly with 144 MB size. During the workflow runtime one can see the named pipe file and a growing corresponding compressed target flat file. At the end of workflow the background compress process will exit out leaving the named pipe file and compressed target flat file on UNIX server. Shell script details (used in the command task) as follows: #! /bin/ksh ####################################################### # Author: Ramadevi Chepuru # Date: January, 2008 # Purpose: To create pipe file and background compress process ########################################################

TGT_DIR=/sites/app/qatport/temp/TgtFiles rm $TGT_DIR/FlatFile_Target.out rm $TGT_DIR/FlatFile_Target.out.Z # Create a pipe file /usr/sbin/mknod $TGT_DIR/FlatFile_Target.out p # Start background compres process nohup compress < $TGT_DIR/FlatFile_Target.out > $TGT_DIR/FlatFile_Target.out.Z & exit; Scenario 3: In this scenario the methodology has been extended to generate multiple target files directly with compression option along with informatica partition session feature. This way you are not only getting the advantage of minimum file size but also load performance. In this scenario, as source table is a partitioned table, configured key range partition on source qualifier partition point and pass through partition option on target table partition point in informatica session. This workflow has the command task as pre session task, which kicks off similar shell script that is in scenario 2 with little modification. The UNIX script in scenario 2 has been modified accordingly to create 3 pipe files (same names as target file names that are mentioned in session) and also starts corresponding 3 background compress processes. During workflow run time one can see three named pipe files and its corresponding growing compressed files as shown below.

The workflow took nearly 7 min. (Vs previous 12 min.) to create 3 compressed target flat files with total size of 143 MB. Utilize Compressed Source Flat Files Directly: Scenario 4: Straight forward informatica workflow took nearly 12 min. to load data from staged source flat file of size 1385 MB into target oracle table. Scenario 5: In this scenario, compressed sources file of size 144 MB was used directly to load data into target table. To do this, a command task was added to

the workflow as a pre session task. The command task kicks off shell script on UNIX server. The script creates a named pipe file with the same name that is mentioned as the source file in informatica session followed by creating a corresponding background uncompress process of original compressed source file. At the end of workflow, background uncompress process will exit out but the named pipe file and compressed source flat file remains on server. This workflow ran in 12 min. to load the same number of source records as in scenario 4. Please note that in this scenario, we started with compressed file of size 144 MB as source file Vs 1387 MB uncompressed source file in scenario 4. The script details (used in the command task) as follows: #! /bin/ksh ####################################################### # Author: Ramadevi Chepuru # Date: January, 2008 # Purpose: To create pipe file and background uncompress process ######################################################## SRC_DIR=/sites/app/qatport/temp/SrcFiles rm $SRC_DIR/FlatFile_Source.out # Create a pipe file /usr/sbin/mknod $SRC_DIR/FlatFile_Source.out p # Start background compres process nohup uncompress < $SRC_DIR/FlatFile_Source.out & exit; Scenario 6: In this scenario, the methodology was extended to utilize multiple compressed source files directly along with informatica partition to load target oracle table with improved load performance. Total size of compressed three source files is 143 MB. Informatica session is configured with pass through partition option on source qualifier partition point and round robin partition option as on target table partition point. This workflow also has the command task before session task, which kicks off similar shell script that is in scenario 5 with little modification. The UNIX script in scenario 5 has been modified accordingly to create 3 pipe files $SRC_DIR/FlatFile_Source.out.Z >

(same names as source file names that are mentioned in session) and kicks off 3 uncompress background processes. The workflow ran in ~ 7 min. to load 12.5 million source records using three compressed source files. At the end of workflow, background uncompress process will quit and one can see three named pipe files and its corresponding compressed files on server as shown below.

Conclusion: This paper overall emphasized on using either compressed source files or to generate compressed target files directly by taking advantage of known unix commands in combination with informatica session features.