Vous êtes sur la page 1sur 73

http://datastageinfosoft.blogspot.ca/2010/09/datastage-81-interview-questions.

html

ASENTIAL DATASTAGE 7.5

1. What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs? 1) Minimize the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator) 2) Use SQL Code while extracting the data Handle the nulls, Minimize the warnings 3) Reduce the number of lookups in a job design Use not more than 20stages in a job 4)Use IPC stage between two passive stages to Reduces processing time 5)Drop indexes before data loading and recreate after loading data into tables 6) There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data. 7) Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option. If the hash file is used only for lookup then "enable Preload to memory". This will improve the performance. Also, check the order of execution of the routines. 8) Don't use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups. 9) Use Preload to memory option in the hash file output. 10) Use Write to cache in the hash file input. 11) Write into the error tables only after all the transformer stages. 12) Reduce the width of the input record - remove the columns that you would not use. 13) Cache the hash files you are reading from and writing into. Make sure your cache is big enough to hold the hash files. (Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.) This would also minimize overflow on the hash file. 14) If possible, break the input into multiple threads and run multiple instances of the job.

15) Staged the data coming from ODBC/OCI/DB2UDB stages for optimum performance also for data recovery in case job aborts.

16) Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. 17) Tuned the 'Project Tunables' in Administrator for better performance. 18) Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. Used sorted data for Aggregator. 19) Removed the data not used from the source as early as possible in the job. 20) Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries 21) Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. 22) If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. 23) Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. 24) Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. 25) Tuning should occur on a job-by-job basis. 26) Using a constraint to filter a record set is much slower than performing a SELECT WHERE. 27) Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OCI.

What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs? Ans: 1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. 2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. 3. Tuned the 'Project Tunables' in Administrator for better performance. 4. Used sorted data for Aggregator. 5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs 6. Removed the data not used from the source as early as possible in the job. 7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries 8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. 9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. 10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.

12. Tuning should occur on a job-by-job basis. 13. Use the power of DBMS. 14. Try not to use a sort stage when you can use an ORDER BY clause in the database. 15. Using a constraint to filter a record set is much slower than performing a SELECT WHERE. 16. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

14. What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs? A: 1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. 2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. 3. Tuned the 'Project Tunables' in Administrator for better performance. 4. Used sorted data for Aggregator. 5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs 6. Removed the data not used from the source as early as possible in the job. 7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries 8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. 9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. 10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. 11. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. 12. Tuning should occur on a job-by-job basis. 13. Use the power of DBMS. 14. Try not to use a sort stage when you can use an ORDER BY clause in the database. 15. Using a constraint to filter a record set is much slower than performing a SELECT WHERE. 16. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
2. How can I extract data from DB2 (on IBM i-series) to the data warehouse via Datastage as the ETL tool? I mean do I first need to use ODBC to create connectivity and use an adapter for the extraction and transformation of data? You would need to install ODBC drivers to connect to DB2 instance (does not come with regular drivers 3

that we try to install, use CD provided for DB2 installation, that would have ODBC drivers to connect to DB2) and then try out

3. What is DS Designer used for - did u use it?

You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links. 4. How to improve the performance of hash file? You can improve performance of hashed file by 1. Preloading hash file into memory -->this can be done by enabling preloading options in hash file output stage 2. Write caching options -->.It makes data written into cache before being flushed to disk. You can enable this to ensure that hash files are written in order onto cash before flushed to disk instead of order in which individual rows are written. 3 .Pre allocating--> Estimating the approx. size of the hash file so that file need not to be spitted to often after write operation

5. How can we pass parameters to job by using file?

You can do like this, by passing parameters from UNIX file, and then calling the execution of a Datastage job. The ds job has the parameters defined. Which are passed by UNIX 6. What is a project? Specify its various components?

You always enter Datastage through a Datastage project. When you start a Datastage client you are prompted to connect to a project.

7. How can u implement slowly changed dimensions in Datastage? Explain? 8. What are built-in components and user-defined components? Built-in components. These are predefined components used in a job. User-defined components. These are customized components created using the Datastage Manager or Datastage Designer

9. Can u join flat file and database in Datastage? How?

Yes, we can do it in an indirect way. First create a job which can populate the data from database into a Sequential file and name it as Seq_First1. Take the flat file which you are a having and use Merge Stage to join the two files. You have various join types in Merge Stage like Pure Inner Join, Left Outer Join, Right Outer Join etc., You can use any one of these which suits your requirements.

10. Can any one tell me how to extract data from more than 1 heterogeneous Sources? Means, example 1 sequential file, Sybase, Oracle in a single Job. Yes you can extract the data from two heterogeneous sources in data stages using the transformer stage it's so simple you need to just form a link between the two sources in the transformer stage.

11. Will Datastage consider the second constraint in the transformer if the first constraint is satisfied (if link ordering is given)?" Answer: Yes. 12. How we use NLS function in Datastage? What are advantages of NLS function? Where we can use that one? Explain briefly? By using NLS function we can do the following - Process the data in a wide range of languages - Use Local formats for dates, times and money - Sort the data according to the local rules

If NLS is installed, various extra features appear in the product. For Server jobs, NLS is implemented in Datastage Server engine. For Parallel jobs; NLS is implemented using the ICU library.

13. If a Datastage job aborts after say 1000 records, how to continue the job from 1000th record after fixing the error?

By specifying Check pointing in job sequence properties, if we restart the job. Then job will start by skipping upto the failed record. this option is available in 7.5 edition. 14. How to kill the job in data stage? ANS by killing the respective process ID

15. What is an environment variable? What is the use of this?

Basically Environment variable is predefined variable those we can use while creating DS job. We can set either as Project level or Job level. Once we set specific variable that variable will be available into the project/job. We can also define new environment variable. For that we can go to DS Admin. 16. What are all the third party tools used in Datastage? Autosys, TNG, event coordinator are some of them that I know and worked with What is APT_CONFIG in Datastage?

APT_CONFIG is just an environment variable used to identify the *.apt file. Dont confuse that with *.apt file that has the node's information and Configuration of SMP/MMP server. 17. If youre running 4 ways parallel and you have 10 stages on the canvas, how many processes does Datastage create? Answer is 40 you have 10 stages and each stage can be partitioned and run on 4 nodes which makes total number of processes generated are 40

18. Did you Parameterize the job or hard-coded the values in the jobs? Always parameterized the job. Either the values are coming from Job Properties or from a Parameter Manager a third part tool. There is no way you will hardcode some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, and password.

19. Defaults nodes for Datastage parallel Edition

Actually the Number of Nodes depends on the number of processors in your system. If your system is supporting two processors we will get two nodes by default.

20. It is possible to run parallel jobs in server jobs?

No, it is not possible to run Parallel jobs in server jobs. But Server jobs can be executed in Parallel jobs

21. It is possible to access the same job two users at a time in Datastage? No, it is not possible to access the same job two users at the same time. DS will produce the following error: "Job is accessed by other user"

22. Does u know about METASTAGE?

MetaStage is used to handle the Metadata which will be very useful for data linkage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage.

23. What is merge and how it can be done please explain with simple example taking 2 tables

Merge is used to join two tables. It takes the Key columns sort them in Ascending or descending order. Let us consider two table i.e. Emp,Dept.If we want to join these two tables we are having DeptNo as a 7

common Key so we can give that column name as key and sort DeptNo in ascending order and can join those two tables

24. What is difference between Merge stage and Join stage? Merge and Join Stage Difference: 1. Merge Reject Links are there 2. Can take Multiple Update links 3. If you used it for comparison, then first matching data will be the output. Because it uses the update links to extend the primary details which are coming from master link

25. What are the enhancements made in Datastage 7.5 compare with 7.0

Many new stages were introduced compared to Datastage version 7.0. In server jobs we have stored procedure stage, command stage and generate report option was there in file tab. In job sequence many stages like start loop activity, end loop activity, terminates loop activity and user variables activities were introduced. In parallel jobs surrogate key stage, stored procedure stages were introduced.

26. How can we join one Oracle source and Sequential file? Join and look up used to join oracle and sequential file.

27. What is the purpose of exception activity in data stage 7.5? The stages followed by exception activity will be executed whenever there is an unknown error occurs while running the job sequencer. 28. What is Modulus and Splitting in Dynamic Hashed File? The modulus size can be increased by contacting your Unix Admin. 29. What is DS Manager used for - did u use it? 8

The Manager is a graphical tool that enables you to view and manage the contents of the Datastage Repository.

30. What is the difference between Datastage and informatica?

The main difference is Vendors? Each one is having plus from their architecture. For Datastage it is a Top-Down approach. Based on the Business needs we have to choose products.

31. What are Static Hash files and Dynamic Hash files? The hashed files have the default size established by their modulus and separation when you create them, and this can be static or dynamic. Overflow space is only used when data grows over the reserved size for someone of the groups (sectors) within the file. There are many groups as the specified by the modulus.

32. What is the exact difference between Join, Merge and Lookup Stage?

The exact difference between Join, Merge and lookup is The three stages differ mainly in the memory they use Datastage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a lookup stage.

Here's how to decide which to use:

If the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join processing is very fast and never involves paging or other I/O.

Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links as many as input links. 33. How do you eliminate duplicate rows?

The Duplicates can be eliminated by loading the corresponding data in the Hash file. Specify the columns on which u want to eliminate as the keys of hash.

34. What does separation option in static hash-file mean? The different hashing algorithms are designed to distribute records evenly among the groups of the file based on characters and their position in the record ids. When a hashed file is created, Separation and modulo respectively specifies the group buffer size and the number of buffers allocated for a file. When a Static Hash file is created, DATASTAGE creates a file that contains the number of groups specified by modulo. Size of Hash file = modulus (no. groups) * Separations (buffer size) 35. How can we implement Lookup in Datastage Server jobs? The DB2 stage can be used for lookups. In the Enterprise Edition, the Lookup stage can be used for doing lookups. 36. Importance of Surrogate Key in Data warehousing? The concept of surrogate comes into play when there is slowly changing dimension in a table. In such condition there is a need of a key by which we can identify the changes made in the dimensions.

These are system generated key. Mainly they are just the sequence of numbers or can be Alfa numeric values also.

These slowly changing dimensions can be of three types namely SCD1, SCD2, and SCD3.

37. How do we do the automation of dsjobs? We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also pass all the parameters from command prompt. Then call this shell script in any of the market available schedulers. The 2nd option is schedule these jobs using Data Stage director.

38.What is Hash file stage and what is it used for?

10

We can also use the Hash File stage to avoid / remove duplicate rows by specifying the hash key on a particular field.

39.What is version Control? Version Control stores different versions of DS jobs. Runs different versions of same job, reverts to previous version of a job also view version histories.

40. How to find the number of rows in a sequential file? Using Row Count System variable

41. Suppose if there are million records did you use OCI? If not then what stage do you prefer? Using Orabulk

42. How to run the job in command prompt in UNIX? Using dsjob command, -options dsjob -run -jobstatus projectname jobname How to find errors in job sequence? Using Datastage Director we can find the errors in job sequence. How good are you with your PL/SQL? WE will not witting pl/sql in Datastage! Sql knowledge is enough... If I add a new environment variable in Windows, how can I access it in Datastage? You can view all the environment variables in designer. U can check it in Job properties. U can add and access the environment variables from Job properties. How do you pass the parameter to the job sequence if the job is running at night? Two ways: 1. Set the default values of Parameters in the Job Sequencer and map these parameters to job.

11

2. Run the job in the sequencer using dsjobs utility where we can specify the values to be taken for each parameter. What is the transaction size and array size in OCI stage? How these can be used? Transaction Size - This field exists for backward compatibility, but it is ignored for release 3.0 and later of the Plug-in. The transaction size for new jobs is now handled by Rows per transaction on the Transaction Handling tab on the Input page. Rows per transaction - The number of rows written before a commit is executed for the transaction. The default value is 0, that is, all the rows are written before being committed to the data table. Array Size - The number of rows written to or read from the database at a time. The default value is 1, that is, each row is written in a separate statement. What is the difference between DRS (Dynamic Relational Stage) and ODBC STAGE? To answer your question the DRS stage should be faster than the ODBC stage as it uses native database connectivity. You will need to install and configure the required database clients on your Datastage server for it to work. Dynamic Relational Stage was leveraged for People soft to have a job to run on any of the supported databases. It supports ODBC connections too. Read more of that in the plug-in documentation. ODBC uses the ODBC driver for a particular database, DRS (Dynamic Relational stage) is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivitys for the chosen target ... How do you track performance statistics and enhance it? Through Monitor we can view the performance statistics.

What is the mean of Try to have the constraints in the 'Selection' criteria of the jobs itself? This will eliminate the unnecessary records even getting in before joins are made? This means try to improve the performance by avoiding use of constraints wherever possible and instead using them while selecting the data itself using a where clause. This improves performance.

How to drop the index before loading data in target and how to rebuild it in data stage? This can be achieved by "Direct Load" option of SQL Loaded utility.

There are three different types of user-created stages available . What are they?

12

These are the three different stages: i) Custom ii) Build iii) Wrapped How will you call external function or subroutine from Datastage? There is Datastage option to call external programs. ExecSH What is DS Administrator used for - did u use it? The Administrator enables you to set up Datastage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales.

What is the max capacity of Hash file in DataStage? Take a look at the uvconfig file;

- 64BIT_FILES - This sets the default mode used to create static hashed and dynamic files. - A value of 0 results in the creation of 32-bit files. 32-bit files have maximum file size of2 gigabytes. - The value of 1 results in the creation of 64-bit files (ONLY valid on 64-bit capable platforms). - The maximum file size for 64-bit files is system dependent. The default behavior may be overridden by keywords on certain commands. 64BIT_FILES 0.

What is the difference between symmetrically parallel processing, massively parallel processing? Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processors communicate via shared memory and have single operating system. Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have exclusive access to hardware resources. Cluster systems can be physically dispersed. The processor has their own operations system and communicate via high speed network

What is the order of execution done internally in the transformer with the stage editor having input links on the left hand side and output links? Stage variables, constraints and column derivation or expressions 13

What are Stage Variables, Derivations and Constants? Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column. Derivation - Expression that specifies value to be passed on to the target column. Constraint - Conditions that are either true or false that specifies flow of data with a link.

How to implement type2 slowly changing dimension in Datastage? Give me with example? Slow changing dimension is a common problem in Data warehousing. For example: There exists a customer called Lisa in a company ABC and she lives in New York. Later she moved to Florida. The company must modify her address now. In general, there are 3 ways to solve this problem Type 1: The new record replaces the original record, no trace of the old record at all. Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two different people. Type 3: The original record is modified to reflect the changes. In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use. In Type2 New record is added, therefore both the original and the new record Will be present, the new record will get its own primary key, Advantage of using this type2 is, Historical information is maintained But size of the dimension table grows, storage and performance can become a concern. Type2 should only be used if it is necessary for the data warehouse to track the historical changes. In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. Example a new column will be added which shows the original address as New York and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the New York information is lost. So Type 3 should only be used if the changes will only occur for a finite number of times.

Functionality of Link Partitioner and Link Collector?

Server jobs mainly execute the jobs in sequential fashion, the IPC stage as well as link partitioner and link collector will simulate the parallel mode of execution over the server jobs having single CPC Link 14

Partitioner. It receives data on a single input link and diverts the data to a maximum no. of 64 output links and the data processed by the same stage having same meta data Link Collector. It will collect the data from 64 input links, merges it into a single data flow and loads to target. These both are active stages and the design and mode of execution of server jobs has to be decided by the designer.

What happens if the job fails at night? Job Sequence Abort What is job control? How can it used explain with steps?

JCL defines Job Control Language it is used to run more number of jobs at a time with or without using loops. steps: click on edit in the menu bar and select 'job properties' and enter the parameters as parameter prompt typeSTEP_ID STEP_ID string Source SRC stringDSN DSN string Username unm string Password pwd stringafter editing the above steps then set JCL button and select the jobs from the list box and run the job.

What is the difference between Datastage and Datastage TX?

Its a critical question to answer, but one thing I can tell you that Datastage TX is not a ETL tool & this is not a new version of Datastage 7.5.TX is used for ODS source ,this much I know.

If the size of the Hash file exceeds 2GB...What happens? Does it overwrite the current rows? Yes it overwrites the file.

Do you know about INTEGRITY/QUALITY stage?

Integrity/quality stage is a data integration tool from Ascential which is used to standardize/integrate the data from different sources.

15

How much would be the size of the database in Datastage? What is the difference between In process and Interprocess? In-process: You can improve the performance of most DataStage jobs by turning in-process row buffering on and recompiling the job. This allows connected active stages to pass data via buffers rather than row by row. Note: You cannot use in-process row-buffering if your job uses COMMON blocks in transform functions to pass data between stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks. Inter-process Use this if you are running server jobs on an SMP parallel system. This enables the job to run using a separate process for each active stage, which will run simultaneously on a separate processor. Note: You cannot inter-process row-buffering if your job uses COMMON blocks in transform functions to pass data between stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks. How can you do incremental load in Datastage? Incremental load means daily load. When ever you are selecting data from source, select the records which are loaded or updated between the timestamp of last successful load and todays load start date and time. For this u have to pass parameters for those two dates. Store the last run date and time in a file and read the parameter through job parameters and state second argument as current date and time. How do you remove duplicates without using remove duplicate stage? In the target make the column as the key column and run the job.

What r XML files and how do you read data from XML files and what stage to be used? In the pallet there is a Real time stage like xml-input, xml-output, xml-transformer

Where actually the flat files store? What is the path? 16

Flat files stores the data and the path can be given in general tab of the sequential file stage

What is data set? And what is file set?

File set:- It allows you to read data from or write data to a file set. The stage can have a single input link. A single output link and a single rejects link. It only executes in parallel mode the data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns. Datasets r used to import the data in parallel jobs like odbc in server jobs what is meaning of file extender in data stage server jobs. can we run the data stage job from one job to another job? File extender means the adding the columns or records to the already existing the file, in the data stage, we can run the data stage job from one job to another job in data stage. How do you merge two files in DS? Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different.

What is the default cache size? How do you change the cache size if needed? Default read cache size is 128MB. We can increase it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size.

17

What about System variables?

Datastage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only. @DATE the internal date when the program started. See the Date function. @DAY The day of the month extracted from the value in @DATE. @FALSE The compiler replaces the value with 0. @FM A field mark, Char(254). @IM An item mark, Char(255). @INROWNUM Input row counter. For use in constraints and derivations in Transformer stages. @OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages. @LOGNAME The user login name. @MONTH The current extracted from the value in @DATE. @NULL The null value. @NULL.STR The internal representation of the null value, Char(128). @PATH The pathname of the current Datastage project. @SCHEMA The schema name of the current Datastage project. @SM A sub value mark (a delimiter used in Universe files), Char(252). @SYSTEM.RETURN.CODE Status codes returned by system processes or commands. @TIME The internal time when the program started. See the Time function. @TM A text mark (a delimiter used in UniVerse files), Char(251). @TRUE The compiler replaces the value with 1. @USERNO The user number. @VM A value mark (a delimiter used in UniVerse files), Char(253). @WHO The name of the current DataStage project directory. @YEAR The current year extracted from @DATE. REJECTED Can be used in the constraint expression of a Transformer stage of an output link. REJECTED is initially TRUE, but is set to FALSE whenever an output link is successfully written.

Where does UNIX script of Datastage execute weather in client machine or in server? Datastage jobs are executed in the server machines only. There is nothing that is stored in the client machine.

What is DS Director used for - did u use it? 18

Datastage Director is GUI to monitor, run, validate & schedule Datastage server jobs.

What's the difference between Datastage Developers and Datastage Designers? What are the skills required for this.

Datastage developer is one who will code the jobs.datastage designer is one who will design the job, i mean he will deal with blue prints and he will design the jobs, the stages that are required in developing the code

What other ETL's you have worked with? Ab-initio Datastage EE parllel edition oracle -Etl there are 7 ETL in market!

What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job

Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive. B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

What are the command line functions that import and export the DS jobs?

A. dsimport.exe- imports the Datastage components. B. dsexport.exe- exports the Datastage components.

19

Dimensional modeling is again sub divided into 2 types. a) Star Schema - Simple & Much Faster. Denormalized form. b) Snowflake Schema - Complex with more Granularities. More normalized form.

What is sequence stage in job sequencer? What are the conditions?

A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers. The sequencer operates in two modes: ALL mode. In this mode all of the inputs to the sequencer must be TRUE for any of the sequencer outputs to fire. ANY mode. In this mode, output triggers can be fired if any of the sequencer inputs are TRUE

What are the Repository Tables in Datastage and what are they?

A data warehouse is a repository (centralized as well as distributed) of Data, able to answer any adhoc, analytical, historical or complex queries. Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems. In data stage I/O and Transfer, under interface tab: input, out put & transfer pages will have 4 tabs and the last one is build under that u can find the TABLE NAME.

What difference between operational data stage (ODS) & data warehouse?

A Dataware house is a decision support database for organizational needs. It is subject oriented, Non volatile, integrated, time variant collect of data. ODS (Operational Data Source) is a integrated collection of related information. It contains maximum 90 days information.

20

How many jobs have you created in your last project?

100+ jobs for every 6 months if you are in Development, if you are in testing 40 jobs for every 6 months although it need not be the same number for everybody

1.What about System variables? 2.How can we create Containers? 3.How can we improve the performance of DataStage? 4.what are the Job parameters? 5.what is the difference between routine and transform and function? 6.What are all the third party tools used in DataStage? 7.How can we implement Lookup in DataStage Server jobs? 8.How can we implement Slowly Changing Dimensions in DataStage?. 9.How can we join one Oracle source and Sequential file?. 10.What is iconv and oconv functions? 11.Difference between Hashfile and Sequential File? 12. Maximum how many characters we can give for a Job name in DataStage?

How do you pass filename as the parameter for a job? 1. Go to DataStage Administrator->Projects->Properties->Environment->UserDefined. Here you can see a grid, where you can enter your parameter name and the corresponding the path of the file. 2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and select the parameter name which you have given in the above. The selected parameter name appears in the text box beside the "Use Job Parameter" button. Copy the parameter name from the text box and use it in your job. Keep the project default in the text box.

21

How to remove duplicates in server job 1)Use a hashed file stage or 2) If you use sort command in UNIX(before job sub-routine), you can reject duplicated records using -u parameter or 3)using a Sort stage

What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?

AUTOSYS": Thru autosys u can automate the job by invoking the shell script written to schedule the datastage jobs.

It is possible to call one job in another job in server jobs?

I think we can call a job into another job. In fact calling doesn't sound good, because you attach/add the other job through job properties. In fact, you can attach zero or more jobs. Steps will be Edit --> Job Properties --> Job Control Click on Add Job and select the desired job.

How do u clean the datastage repository. REmove log files periodically.....

If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise? Data will partitioned on both the keys ! hardly it will take more for execution .

What is job control?how it is developed?explain with steps?

22

Controlling Datstage jobs through some other Datastage jobs. Ex: Consider two Jobs XXX and YYY. The Job YYY can be executed from Job XXX by using Datastage macros in Routines. To Execute one job from other job, following steps needs to be followed in Routines. 1. Attach job using DSAttachjob function. 2. Run the other job using DSRunjob function 3. Stop the job using DSStopJob function Containers: Usage and Types?

Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project. There are two types of shared container: 1.Server shared container. Used in server jobs (can also be used in parallel jobs). 2.Parallel shared container. Used in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you could use one to make a server plug-in stage available to a parallel job). What does a Config File in parallel extender consist of? a) Number of Processes or Nodes. b) Actual Disk Storage Location.

How can you implement Complex Jobs in Datastage Complex design means having more joins and more look ups. Then that job design will be called as complex job. We can easily implement any complex design in DataStage by following simple tips in terms of increasing performance also. There is no limitation of using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding 20 stages then go for another job.Use not more than 7 look ups for a transformer otherwise go for including one more transformer.

What are validations you perform after creating jobs in designer. 23

What r the different type of errors u faced during loading and how u solve them Check for Parameters. and check for input files are existed or not and also check for input tables existed or not and also usernames,datasource names, passwords like that

What user variable activity when it used how it used !where it is used with real example

By using This User variable activity we can create some variables in the job sequnce,this variables r available for all the activities in that sequnce. Most probably this activity is @ starting of the job sequence

I want to process 3 files in sequentially one by one , how can i do that. while processing the files it should fetch files automatically . If the metadata for all the files r same then create a job having file name as parameter, then use same job in routine and call the job with different file name...or u can create sequencer to use the job...

What happens out put of hash file is connected to transformer..What error it through?

If Hash file output is connected to transformer stage the hash file will consider as the Lookup file if there is no primary link to the same Transformer stage, if there is no primary link then this will treat as primary link itself. you can do SCD in server job by using Lookup functionality. This will not return any error code. What is iconv and oconv functions? Iconv( )-----converts string to internal storage format. Oconv( )----converts an expression to an output format. What are OConv () and Iconv () functions and where are they used? iconv is used to convert the date into into internal format i.e only datastage can understand example :- date comming in mm/dd/yyyy format 24

datasatge will conver this ur date into some number like :- 740 u can use this 740 in derive in ur own format by using oconv. suppose u want to change mm/dd/yyyy to dd/mm/yyyy.now u will use iconv and oconv. ocnv(iconv(datecommingfromi/pstring,SOMEXYZ(seein help which is iconvformat),defineoconvformat))

What is ' inserting for update ' in datastage I think 'insert to update' is updated value is inserted to maintain history

How I can convert Server Jobs into Parallel Jobs?

I have never tried doing this, however, I have some information which will help you in saving a lot of time. You can convert your server job into a server shared container. The server shared container can also be used in parallel jobs as shared container.

Can we use shared container as lookup in Datastage server jobs? I am using DataStage 7.5, Unix. we can use shared container more than one time in the job.There is any limit to use it. why because in my job i used the Shared container at 6 flows. At any time only 2 flows are working. can you please share the info on this.

DataStage from Staging to MDW is only running at 1 row per second! What do we do to remedy?

I am assuming that there are too many stages, which is causing problem and providing the solution. In general. if you too many stages (especially transformers , hash look up), there would be a lot of overhead and the performance would degrade drastically. I would suggest you to write a query instead of doing several look ups. It seems as though embarassing to have a tool and still write a query but that is best at times. If there are too many look ups that are being done, ensure that you have appropriate indexes while 25

querying. If you do not want to write the query and use intermediate stages, ensure that you use proper elimination of data between stages so that data volumes do not cause overhead. So, there might be a re-ordering of stages needed for good performance. Other things in general that could be looked in: 1) for massive transaction set hashing size and buffer size to appropriate values to perform as much as possible in memory and there is no I/O overhead to disk. 2) Enable row buffering and set appropate size for row buffering 3) It is important to use appropriate objects between stages for performance

What is the flow of loading data into fact & dimensional tables?

Here is the sequence of loading a data warehouse. 1. The source data is first loading into the staging area, where data cleansing takes place. 2. The data from staging area is then loaded into dimensions/lookups. 3. Finally the Fact tables are loaded from the corresponding source tables from the staging area.

What is the difference between sequential file and a dataset? When to use the copy stage? Sequential Stage stores small amount of the data with any extension in order to access the file where as Dataset is used to store huge amount of the data and it opens only with an extension (.ds) The Copy stage copies a single input data set to a number of output datasets. Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of columns.

What happens if RCP is disable? Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those stages whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time. 26

If RCP is disabled for the job, in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.

What are Routines and where/how are they written and have you written any routines before? Routines are stored in the Routines branch of the DataStage Repository,where you can create, view, or edit them using the Routine dialog box. The following program components are classified as routines: Transform functions. These are functions that you can use whendefining custom transforms. DataStage has a number of built-intransform functions which are located in the Routines Examples Functions branch of the Repository. You can also defineyour own transform functions in the Routine dialog box. Before/After subroutines. When designing a job, you can specify asubroutine to run before or after the job, or before or after an activestage. DataStage has a number of built-in before/after subroutines,which are located in the Routines Built-in Before/Afterbranch in the Repository. You can also define your ownbefore/after subroutines using the Routine dialog box. Custom UniVerse functions. These are specialized BASIC functionsthat have been defined outside DataStage. Using the Routinedialog box, you can get DataStage to create a wrapper that enablesyou to call these functions from within DataStage. These functionsare stored under the Routines branch in the Repository. Youspecify the category when you create the routine. If NLS is enabled,

How we can call the routine in datastage job?explain with steps?

Routines are used for implementing the business logic they are two types 1) Before Sub Routines and 2)After Sub Routinestepsdouble click on the transformer stage right click on any one of the mapping field select [dstoutines] option within edit window give the business logic and select the either of the options( Before / After Sub Routines)

How can we improve the performance of Datastage jobs? Performance and tuning of DS jobs: 1.Establish Baselines 2.Avoid the Use of only one flow for tuning/performance testing 3.Work in increment 4.Evaluate data skew 5.Isolate and solve 27

6.Distribute file systems to eliminate bottlenecks 7.Do not involve the RDBMS in initial testing 8.Understand and evaluate the tuning knobs available.

Types of Parallel Processing?

Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. What are orabulk and bcp stages?

ORABULK is used to load bulk data into single table of target oracle database. BCP is used to load bulk data into a single table for microsoft sql server and sysbase

How can ETL excel file to Data mart?

Open the ODBC Data Source Administrator found in the control panel/administrative tools. Under the system DSN tab, add the Driver to Microsoft Excel. Then u'll be able to access the XLS file from Datastage.

What is the OCI? and how to use the ETL Tools?

OCI doesn't mean the orabulk data. It actually uses the "Oracle Call Interface" of the oracle to load the data. It is kind of the lowest level of Oracle being used for loading the data.

What are the different types of lookups in Datastage?

28

There are two types of lookups, lookup stage and lookupfileset. Lookup: Lookup reference to another stage or Database to get the data from it and transforms to other database. LookupFileSet: It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link. When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .fs.

How can we create Containers?

There are two types of containers 1.Local Container. 2.Shared Container. Local container is available for that particular Job only. Where as Shared Containers can be used any where in the project. Local container: Step1:Select the stages required Step2:Edit>Construct Container>Local SharedContainer: Step1:Select the stages required Step2:Edit>Construct Container>Shared Shared containers are stored in the Shared Containers branch of the Tree Structure

How do you populate source files?

There are many ways to populate one is writing SQL statement in oracle is one way

29

What are the differences between the data stage 7.0 and 7.5 in server jobs?

There are lot of Differences: There are lot of new stages are available in DS7.5 For Eg: CDC Stage Stored procedure Stage etc..

Briefly describe the various client components?

There are four client components Data stage Designer. A design interface used to create Datastage applications (known as jobs). Each job specifies the data sources, the transforms required, and the destination of the data. Jobs are compiled to create executables that are scheduled by the Director and run by the Server. Data stage Director. A user interface used to validate, schedule, run, and monitor Datastage jobs. Datastage Manager. A user interface used to view and edit the contents of the Repository. Datastage Administrator. A user interface used to configure Datastage projects and users.

Types of vies in Datastage Director?

There are 3 types of views in Datastage Director a) Job View - Dates of Jobs Compiled. b) Log View - Status of Job last run c) Status View - Warning Messages, Event Messages, and Program Generated Messages.

What are the environment variables in datastage?give some examples?

There are the variables used at the project or job level. We can use them to configure the job ie.we can associate the configuration file (Without this u can not run ur job), increase the sequential or dataset read/ write buffer. ex: $APT_CONFIG_FILE Like above we have so many environment variables. Please go to job properties and click on "add 30

environment variable" to see most of the environment variables.

What are the Steps involved in development of a job in DataStage? The steps required are: select the data source stage depending upon the sources for ex:flatfile,database, xml etc select the required stages for transformation logic such as transformer, link collector, link Partitioner, Aggregator, merge etc select the final target stage where u want to load the data either it is datawatehouse, data mart, ODS,staging etc what is the difference between validated ok and compiled in Datastage.

When we say "Validating a Job", we are talking about running the Job in the "check only" mode. The following checks are made: - Connections are made to the data sources or data warehouse. - SQL SELECT statements are prepared. - Files are opened. Intermediate files in Hashed File, UniVerse, or ODBC stages that use the local data source are created, if they do not already exist.

Why do you use SQL LOADER or OCI STAGE?

When the source data is anormous or for bulk data we can use OCI and SQL loader depending upon the source

Where we use link Partitioner in data stage job? explain with example?

We use Link Partitioner in DataStage Server Jobs.The Link Partitioner stage is an active stage which takes one input andallows you to distribute partitioned rows to up to 64 output links.

31

Purpose of using the key and difference between Surrogate keys and natural key

We use keys to provide relationships between the entities (Tables). By using primary and foreign key relationship, we can maintain integrity of the data. The natural key is the one coming from the OLTP system. The surrogate key is the artificial key which we are going to create in the target DW. We can use these surrogate keys insted of using natural key. In the SCD2 scenarions surrogate keys play a major role

How does Datastage handle the user security?

We have to create users in the Administrators and give the necessary privileges to users.

How to parameterize a field in a sequential file? I am using Datastage as ETL Tool, Sequential file as source.

We cannot parameterize a particular field in a sequential file, instead we can parameterize the source file name in a sequential file.

Is it possible to move the data from oracle ware house to SAP Warehouse using with DATASTAGE Tool.

We can use Datastage Extract Pack for SAP R/3 and DataStage Load Pack for SAP BW to transfer the data from oracle to SAP Warehouse. These Plug In Packs are available with DataStage Version 7.5

32

How to implement type2 slowly changing dimensions in data stage?explain with example? We can handle SCD in the following ways Type 1: Just use, Insert rows Else Update rows Or Update rows Else Insert rows, in update action of target Type 2: Use the steps as follows a) U have use one hash file to Look-Up the target b) Take 3 instances of target c) Give different conditions depending on the process d) Give different update actions in target e) Use system variables like Sysdate and Null. How to handle the rejected rows in Datastage?

We can handle rejected rows in two ways with help of Constraints in a Tansformer.1) By Putting on the Rejected cell where we will be writing our constraints in the properties of the Transformer2)Use REJECTED in the expression editor of the Constraint Create a hash file as a temporary storage for rejected rows. Create a link and use it as one of the output of the transformer. Apply either of the two steps above said on that Link. All the rows which are rejected by all the constraints will go to the Hash File.

What are the difficulties faced in using Datastage? 1) If the number of lookups are more? 2) If clients want currency in terms of integer in conjunction with character like 2m,3l. 3) What will happen, while loading the data due to some regions job aborts?

Does Enterprise Edition only add the parallel processing for better performance? Are any stages/transformations available in the enterprise edition only? Datastage Standard Edition was previously called Datastage and Datastage Server Edition. Datastage Enterprise Edition was originally called Orchestrate, then renamed to Parallel Extender when purchased by Ascential. Datastage Enterprise: Server jobs, sequence jobs, parallel jobs. The enterprise edition offers parallel processing features for scalable high volume solutions. Designed originally for UNIX, it now supports Windows, Linux and Unix System Services on mainframes. Datastage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, mvs jobs. MVS jobs are jobs designed using an alternative set 33

of stages that are generated into Cobol/JCL code and are transferred to a mainframe to be compiled and run. Jobs are developed on a UNIX or Windows server transferred to the mainframe to be compiled and run. The first two versions share the same Designer interface but have a different set of design stages depending on the type of job you are working on. Parallel jobs have parallel stages but also accept some server stages via a container. Server jobs only accept server stages; MVS jobs only accept MVS stages. There are some stages that are common to all types (such as aggregation) but they tend to have different fields and options within that stage.

What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?

AUTOSYS": Through autosys we can automate the job by invoking the shell script written to schedule the Datastage jobs.

Is it possible to call one job in another job in server jobs?

I think we can call a job into another job. In fact calling doesn't sound good, because you attach/add the other job through job properties. In fact, you can attach zero or more jobs. Steps will be Edit --> Job Properties --> Job Control, Click on Add Job and select the desired job.

How does u clean the Datastage repository? Remove log files periodically.....

If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise? Data will partition on both the keys! Hardly will it take more for execution.

Dimension Modeling types along with their significance 34

Ans. 1) E-R Diagrams 2) Dimensional modeling a) logical modeling

b) Physical modeling

What is job control? How it is developed? Explain with steps?

Controlling Datstage jobs through some other Datastage jobs. Ex: Consider two Jobs XXX and YYY. The Job YYY can be executed from Job XXX by using Datastage macros in Routines. To execute one job from other job, following steps needs to be followed in Routines. 1. Attach job using DSAttachjob function. 2. Run the other job using DSRunjob function 3. Stop the job using DSStopJob function Containers: Usage and Types?

Container is a collection of stages used for the purpose of Reusability.

There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project.

There are two types of shared container: 1. Server shared container. Used in server jobs (can also be used in parallel jobs). 2. Parallel shared container. Used in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you could use one to make a server plug-in stage available to a parallel job).

35

What are constraints and derivation?

Constraint specifies the condition under which data flow through the output link. Constraint which output link is used. Constraints are nothing but business rule or logic. For example-we have to split customers.txt file into customer address files based on customer country, we need to pass constraints. Suppose we want us customer addresses we need to pass constraint for us customer.txt file. Similarly for Canadian and Australian customer.

Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set as a constraint and it means and only those records meeting this will be processed further. Derivation is a method of deriving the fields, for example if you need to get some SUM, AVG etc.derivations specifies the expression to pass values to the target column. For simple example input column is a derivation that passes the value to target column.

What is the difference between constraint and derivations?

Constraints are applied to links where as derivations are applied to columns.

Explain the process of taking backup in Datastage?

Any Datastage objects including whole projects, which are stored in manager repository, can be exported to a file. This exported file can then imported back into Datastage.

Import and export can be use for many purposes, like

1) Backing up jobs and projects. 2) Maintaining different version of jobs or project. 3) Moving Datastage jobs from one project to another project. Just export the objects,

36

move to the other project, and then re-import them into the new project. 4) Sharing jobs and projects between developers, the export files, when zipped, are small can be easily emailed from developer to another.

How can you implement Complex Jobs in Datastage? Complex design means having more joins and more lookups. Then that job design will be called as complex job. We can easily implement any complex design in Datastage by following simple tips in terms of increasing performance also. There is no limitation of using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding 20 stages then go for another job. Use not more than 7 look ups for a transformer otherwise go for including one more transformer.

What are validations you perform after creating jobs in designer?

Validation guarantees that Datastage job will be successful, it carry out fallowing without actually data processing. 1) 2) 3) 4) Connections are made for sources. Opens the files. Prepares the sql statements necessary for fetching the data. It makes all connection from source to target that ready for data processing from source to target.

5) Check for Parameters. And check for input files are existed or not and also check for input tables existed or not and also usernames, data source names, passwords like that

What r the different type of errors u faced during loading and how u solves them? How do you fix the error "OCI has fetched truncated data" in DataStage Can we use Change capture stage to get the truncated datas? Members please confirm

37

What user variable activity when it used how it used !where it is used with real example

By using This User variable activity we can create some variables in the job sequnce, this variables r available for all the activities in that sequence. Most probably this activity is @ starting of the job sequence

What is the meaning of the following.. 1) If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel ANS: row partitioning and collecting. 2) Tuning should occur on a job-by-job basis. Use the power of DBMS.

If u have SMP machines u can use IPC, link-colector, link-partitioner for performance tuning If u have cluster,MPP machines u can use parallel jobs

What is DS Manager used for? The Manager is a graphical tool that enables you to view and manage the contents of the DataStage Repository . Its main use of export and import is sharing the jobs and projects one project to other project. How to handle Date convertions in Datastage? We use a) Iconv function - Internal Convertion. b) Oconv function - External Convertion. eg: Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,D/MDY*2,2,4+),D-MDY*2,2,4+) Importance of Surrogate Key in Data warehousing?

38

The concept of surrogate comes into play when there is slowely changing dimension in a table. In such condition there is a need of a key by which we can identify the changes made in the dimensions. These slowely changing dimensions can be of three type namely SCD1,SCD2,SCD3. These are system generated key. Mainly they are just the sequence of numbers or can be alphanumeric values also. How can we implement Lookup in DataStage Server jobs? We can use a Hash File as a lookup in server jobs. The hash file needs at least one key column to create. How can u implement slowly changed dimensions in datastage?

Yes you can implement Type1 Type2 or Type 3. Let me try to explain Type 2 with time stamp. Step :1 time stamp we are creating via shared container. it return system time and one key. For satisfying the lookup condition we are creating a key column by using the column generator. Step 2: Our source is Data set and Lookup table is oracle OCI stage. by using the change capture stage we will find out the differences. the change capture stage will return a value for chage_code. based on return value we will find out whether this is for insert , Edit,??or update. if it is insert we will modify with current timestamp and the old time stamp will keep as history. Sep 19 Summarize the differene between OLTP,ODS AND DATA WAREHOUSE ? OLTP - means online transaction processing, it is nothing but a database, we are calling oracle, sqlserver, and db2 are olap tools. OLTP databases, as the name implies, handle real time transactions which inherently have some special requirements. ODS- stands for Operational Data Store. Its a final integration point ETL process we load the data in ODS before you load the values in target.. Data Warehouse- Datawarehouse is collection of integrated, time varient, non volatile and time variant collection of data which is used to take management decisions. Why OLTP database are designs not generally a good idea for a Data Warehouse OLTP cannot store historical information about the organization. It is used for storing the details of daily transactions while a data warehouse is a huge storage of historical information obtained from different datamarts for making intelligent decisions about the organization. What is data cleaning? How is it done? 39

I can simply say it as Purifying the data. Data Cleansing: the act of detecting and removing and/or correcting a databases dirty data (i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly) What is a level of Granularity of a fact table? Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based on design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail are you willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it upto minute and put that data. It also means that we can have (for example) data agregated for a year for a given product as well as the data can be drilled down to Monthly, weekl and daily basisteh lowest level is known as the grain. going down to details is Granularity Which columns go to the fact table and which columns go the dimension table? The Aggreation or calculated value colums will go to Fac Tablw and details information will go to diamensional table. To add on, Foreign key elements along with Business Measures, such as Sales in $ amt, Date may be a business measure in some case, units (qty sold) may be a business measure, are stored in the fact table. It also depends on the granularity at which the data is stored.

What is a CUBE in datawarehousing concept? Cubes are logical representation of multidimensional data.The edge of the cube contains dimension members and the body of the cube contains data values. What is SCD1 , SCD2 , SCD3? SCD Type 1, the attribute value is overwritten with the new value, obliterating the historical attribute values.For example, when the product roll-up changes for a given product, the roll-up attribute is merely updated with the current value. SCD Type 2,a new record with the new attributes is added to the dimension table. Historical fact table rows continue to reference the old dimension key with the old roll-up attribute; going forward, the fact table rows will reference the new surrogate key with the new roll-up thereby perfectly partitioning history. SCDType 3, attributes are added to the dimension table to support two simultaneous roll-ups - perhaps the current product roll-up as well as ?current version minus one?, or current version and original. What is real time data-warehousing?

40

Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data warehousing. Real-time activity is activity that is happening right now. The activity could be anything such as the sale of widgets. Once the activity is complete, there is data about it. Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available. What is ER Diagram ? The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way to unify the network and relational database views. ? What is a lookup table? A lookup table is nothing but a lookup it give values to referenced table (it is a reference), it is used at the run time, it saves joins and space in terms of transformations. Example, a lookup table called states, provide actual state name (Texas) in place of TX to the output.

What is the main difference between schema in RDBMS and schemas in Data Warehouse.? RDBMS Schema * Used for OLTP systems * Traditional and old schema * Normalized * Difficult to understand and navigate * cannot solve extract and complex problems * poorly modelled DWH Schema * Used for OLAP systems * New generation schema * De Normalized * Easy to understand and navigate * Extract and complex problems can be easily solved * Very good model What is the need of surrogate key; why primary key not used as surrogate key?

41

Surrogate Key is an artificial identifier for an entity. In surrogate key values are? Generated by the system sequentially (Like Identity property in SQL Server and Sequence in Oracle). They do not describe anything. Primary Key is a natural identifier for an entity. In Primary keys all the values are entered manually by the users which are uniquely identified. There will be no repetition of data. Need for surrogate key not Primary Key If a column is made a primary key and? Later there needs? a change in the data type or the length for that column then all the foreign keys that are dependent on that primary key should be changed making the database Unstable Surrogate Keys make the database more stable because it insulates the Primary and foreign key relationships from changes in the data types and length.

What is Snow Flake Schema? Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a product dimension table in a star schema might be normalized into a products table, a product category table, and a product manufacturer table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance

INTER VIEW QUESTIONS (GENERAL)

What is WH scheme like star scheme, snow flake and there advantages / disadvantages under different conditions? How to design an optimized data warehouse both from data upload and query performance point of view? What to exactly is parallel processing and partitioning & how it can be employed for optimizing the data warehouse design? What are preferred indexes & constraints for DWH ? How the volume of data (from medium to very high) and frequency of querying will effect the d/n considerations ?

42

why DATAWARE HOUSE ? Different between OLTP & OLAP ? What is the feature of DWH ? Do you know some more ETL TOOL ? what is the use of staging Area ? Do you know the life cycle of WH ? Did you heard about star Tell me about ur self ? How many dimension & Fact are there in ur project ? What is Dimension ? Different between DWH & DATA MART ? 1. How can you Explain DWH to a Lay man? 2. What is Molap and Rolap? What is Diff between Them? 3. what are Diff Schemas used in DWH? Which one is most Commonly Used? 4. What is Snow Flex Schema ?Explain? 5. what is Star Schema ?Expalin? 6. How you Decide that you have to Go for Ware Hosing?In Requirement Study? 7. What are all the Questions you Put to yourClient?when you are Designing DWH?

Oracle : how many type of Indexes are there.. In ware house which indexes are used what is diff betw Trancate and Delete table ..

43

how do you Optimise the Query..Read Optimisation in Oracle..

Project : Project Description and All....

What is a Data Warehouse?. What is the difference between OLTP and OLAP?. How can you explain Data ware house to a lay man?.

Do you mean to say that if we increase the system resources like RAM, Hard Disk and Processor, we can as well make an OLTP system behave as a DWH?. As a DBA what are the differences do you think a DWH architecture should have or what are the parameters that you are concerned about when taking DWH into account?. 44 What are indexes and what are the different Types?. Why should you do indexing first of all?. What sort of indexing is done in Fact and why?. What sort of indexing is done in Dimensions and why?. What sort of normalization will you have on dimensions and facts?. What are materialized views?. What is a Star schema?. What is a Snow Flake schema?. What is the difference between those two?. Which one would you choose and why?. What are dimensions?. What are facts?. What is your role in the projects that you have done?.

45

What is Informatica powercenter capabilities?. What are the different types of Transformers?. What is the difference between Source and Joiner transformers? Why is a source used and why is a joiner used?. What are active and passive transformers?. How many transformers have you used?. What is CDC?. What are SCDs and what are the different types?. Which of the types have you used in your project?. What is a date dimension How have you handled sessions?. How can you handle multiple sessions and batches?. On what platform was your Informatica server?. How many mappings were there in your projects?. How many transformers did your biggest mapping have?. Which was your source and which was your target platforms?. What is a cube?. How can you create a catalog and what are its types?. What is power play transformer?. What is Power play administrator?(Same as above question). What is Slice and Dice?. What is your idea of an Authenticator?. Can cubes exist independently?. Can cubes be sources to another application?. How many maximum number of rows can a date dimension have?.

How have you done reporting?. What are hotfiles? What are snapshots?. What is the difference?.

I-GATE INTERVIEW QUESTION

1] What is the difference between snow flake and star flake schemas? 2] How will u come to know that u have to do performance tuning? 3] Describe project 4] How many dimensions and facts in your project? 5] Draw scd type 1 and scd type 2 6] If from target you are getting timestamp data and you have one port in target having data type as date then how will u load it? 7] What is the different type of lookups? 8] What condition will you give in update strategy transformation in scd type 1? 9] What the different type of variables in update transformation? 10] What is target based commit and source based commit? 11] Why you think scd type 2 is critical? 12] What is the type of facts? 13] What is fact less fact? 14] If I am returning one port through connected lookup then why you need unconnected lookup? 15] If from flat file some duplicate rows are coming then how will you remove it using informatica? 16] If from relational table duplicate rows are coming then how will you remove them using informatica? 17] If I did not give group by option in aggregator transformation then what will be the result?

46

18] What is multidimensional analysis? 19] If I give all the characteristics of datawearhouse to oltp then will it be data warehouse? 20] What are the characteristics of datawearhouse? 21] What is the break up of your team? 22] How will you do performance tuning in mapping? 23] Which is good for performance static or dynamic cache? 24] What is target load order? 25] What is the transformation you worked on? 26] What is the naming convention you are using? 27] How are you getting data from client? 28] How will you convert rows into column and column into rows using informatica? 29] How will you enable test load? 30] Did you work with connected and unconnected lookup tell the difference 31] Did you ever use normalizer?

Here are some questions I had faced during interviews...

What is a Data Warehouse?. What is the difference between OLTP and OLAP?. How can you explain Data ware house to a lay man?.

Do you mean to say that if we increase the system resources like RAM, Hard Disk and Processor, we can as well make an OLTP system behave as a DWH?. As a DBA what are the differences do you think a DWH architecture should have or what parameters that you are concerned about when taking DWH into account?. What are indexes and what are the different Types?. are the

47

Why should you do indexing first of all?. What sort of indexing is done in Fact and why?. What sort of indexing is done in Dimensions and why?. What sort of normalization will you have on dimensions and facts?. What are materialized views?. What is a Star schema?. What is a Snow Flake schema?. What is the difference between those two?. Which one would you choose and why?. What are dimensions?. What are facts?. What is your role in the projects that you have done?. What is Informatica powercenter capabilities?. What are the different types of Transformers?. What is the difference between Source and Joiner transformers?. Why is a source used and why is a joiner used?. What are active and passive transformers?. How many transformers have you used?. What is CDC? What are SCDs and what are the different types? Which of the types have you used in your project? What is a date dimension? How have you handled sessions? How can you handle multiple sessions and batches? On what platform was your Informatica server?

48

How many mappings were there in your projects? How many transformers did your biggest mapping have?. Which was your source and which were your target platforms? What is a cube? How can you create a catalog and what are its types?. What is power play transformer?. What is Power play administrator?(Same as above question). What is Slice and Dice? What is your idea of an Authenticator? Can cubes exist independently? Can cubes be sources to another application? How many maximum number of rows can a date dimension have?. How have you done reporting?. What are hot files? What are snapshots?. What is the difference?

1) What are the Steps involved in development of a job in DataStage? Ans: The steps required are: select the datasource stage depending upon the sources for ex:flatfile,database, xml etc select the required stages for transformation logic such as transformer,link collector,link partitioner, Aggregator, merge etc select the final target stage where u want to load the data either it is datawatehouse, datamart, ODS,staging etc 2) How does DataStage handle the user security? Ans: We have to create users in the Administrators and give the necessary privileges to users. 3) What is the difference between drs and odbc stage ?

49

Ans: To answer your question the DRS stage should be faster then the ODBC stage as it uses native database connectivity. You will need to install and configure the required database clients on your DataStage server for it to work. Dynamic Relational Stage was leveraged for Peoplesoft to have a job to run on any of the supported databases. It supports ODBC connections too. Read more of that in the plug-in documentation. ODBC uses the ODBC driver for a particular database, DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivities for the chosen target ... 4) How to use rank&updatestratergy in datastage? Ans: Don't mix informatica with Datastage. In Datastage, we dont have such kind of stages. 5) How I can convert Server Jobs into Parallel Jobs? Ans: I have never tried doing this, however, I have some information which will help you in saving a lot of time. You can convert your server job into a server shared container. The server shared container can also be used in parallel jobs as shared container.

I don't think this conversion might be possible. Using ipc stage or link partitioner/link collector can incorporate a n amount of parallelism in the server jobs. 6) It is possible to call one job in another job in server jobs? Ans: I think we can call a job into another job. In fact calling doesn't sound good, because you attach/add the other job through job properties. In fact, you can attach zero or more jobs. Steps will be Edit --> Job Properties --> Job Control Click on Add Job and select the desired job. 7) What is data set? and what is file set? Ans: File set:- It allows you to read data from or write data to a file set. The stage can have a single input link. a single output link, and a single rejects link. It only executes in parallel modeThe data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns. Datasets r used to import the data in parallel jobs like odbc in server jobs 8) It is possible to run parallel jobs in server jobs?

50

Ans: No, It is not possible to run Parallel jobs in server jobs. But Server jobs can be executed in Parallel jobs.

9) What are orabulk and bcp stages? Ans: ORABULK is used to load bulk data into single table of target oracle database. BCP is used to load bulk data into a single table for microsoft sql server and sysbase. 10) What user varibale activity when it used how it used !where it is used with real example? Ans:By using This User variable activity we can create some variables in the job sequnce

--------------------------------------------------------------------------------------------------------------------------Read the String functions in DS A: Functions like [] -> sub-string function and ':' -> concatenation operator Syntax: string [ [ start, ] length ] string [ delimiter, instance, repeats ] What are Sequencers? A: Sequencers are job control programs that execute other jobs with preset Job parameters. 14. What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs? A: 1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. 2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. 3. Tuned the 'Project Tunables' in Administrator for better performance. 4. Used sorted data for Aggregator. 5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs 6. Removed the data not used from the source as early as possible in the job. 7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries 8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
51

9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. 10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. 11. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. 12. Tuning should occur on a job-by-job basis. 13. Use the power of DBMS. 14. Try not to use a sort stage when you can use an ORDER BY clause in the database. 15. Using a constraint to filter a record set is much slower than performing a SELECT WHERE. 16. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE. Do u know about METASTAGE? A: MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage. Explain the differences between Oracle8i/9i? A: Oracle 8i does not support pseudo column sysdate but 9i supports Oracle 8i we can create 256 columns in a table but in 9i we can upto 1000 columns(fields) . How do you merge two files in DS? A: Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate the 2 files into one if the metadata is different. 28. What are Static Hash files and Dynamic Hash files? A: As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size.

Overview:DataStage Interview Questions and Answers,Solution and Explanation How did you handle reject data? Ans: Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected. If worked with DS6.0 and latest versions what are Link-Partitioner and Link-Collector used for? Ans: Link Partitioner - Used for partitioning the data. Link Collector - Used for collecting the partitioned data. What are Routines and where/how are they written and have you written any routines before?

52

Ans: Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines: 1) Transform functions 2) Before-after job subroutines 3) Job Control routines What are OConv () and Iconv () functions and where are they used? Ans: IConv() - Converts a string to an internal storage format OConv() - Converts an expression to an output format. How did you connect to DB2 in your last project? Ans: Using DB2 ODBC drivers. Explain METASTAGE? Ans: MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage. Do you know about INTEGRITY/QUALITY stage? Ans: Qulaity Stage can be integrated with DataStage, In Quality Stage we have many stages like investigate, match, survivorship like that so that we can do the Quality related works and we can integrate with datastage we need Quality stage plugin to achieve the task. Explain the differences between Oracle8i/9i? Ans: Oracle 8i does not support pseudo column sysdate but 9i supports Oracle 8i we can create 256 columns in a table but in 9i we can upto 1000 columns(fields) How do you merge two files in DS? Ans: Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate the 2 files into one if the metadata is different. What is DS Designer used for? Ans: You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links. What is DS Administrator used for? Ans: The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales. What is DS Director used for? Ans: datastage director is used to run the jobs and validate the jobs. we can go to datastage director from datastage designer it self. What is DS Manager used for? Ans: The Manager is a graphical tool that enables you to view and manage the contents of the DataStage Repository

53

What are Static Hash files and Dynamic Hash files? Ans: As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size. What is Hash file stage and what is it used for? Ans: Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance. How are the Dimension tables designed? Ans: Find where data for this dimension are located. Figure out how to extract this data. Determine how to maintain changes to this dimension. Change fact table and DW population routines. Overview:Free DataStage Interview Questions and Answers,Data Warehousing Questions and Answers, DS Interview Questions and Answers,Solutions for Performance Tuning What are conformed dimensions? Ans: A conformed dimension is a single, coherent view of the same piece of data throughout the organization. The same dimension is used in all subsequent star schemas defined. This enables reporting across the complete data warehouse in a simple format. Why fact table is in normal form? Ans: Basically the fact table consists of the Index keys of the dimension/ook up tables and the measures. so when ever we have the keys in a table .that itself implies that the table is in the normal form. What is a linked cube? Ans: A cube can be stored on a single analysis server and then defined as a linked cube on other Analysis servers. End users connected to any of these analysis servers can then access the cube. This arrangement avoids the more costly alternative of storing and maintaining copies of a cube on multiple analysis servers. linked cubes can be connected using TCP/IP or HTTP. To end users a linked cube looks like a regular cube. What is degenerate dimension table? Ans: The values of dimension which is stored in fact table is called degenerate dimensions. these dimensions doesn,t have its own dimensions. What is ODS? Ans: ODS stands for Online Data Storage. Operational Data Store

What is a general purpose scheduling tool? Ans: The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data from Source To Target at specific time or based on some condition. What is the need of surrogate key;why primary key not used as surrogate key?

54

Ans: Surrogate Key is an artificial identifier for an entity. In surrogate key values are generated by the system sequentially(Like Identity property in SQL Server and Sequence in Oracle). They do not describe anything. Primary Key is a natural identifier for an entity. In Primary keys all the values are entered manually by the user which are uniquely identified. There will be no repetition of data What is Hash file stage and what is it used for? Ans: Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance. What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? Ans: Use crontab utility along with d***ecute() function along with proper parameters passed. What is the OCI? and how to use the ETL Tools? Ans: OCI means orabulk data which used client having bulk data its retrieve time is much more ie., your used to orabulk data the divided and retrieved What are OConv () and Iconv () functions and where are they used? Ans: IConv() - Converts a string to an internal storage formatOConv() - Converts an expression to an output format. What is Fact table? Ans: Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales" , then a measurement of this business process such as "monthly sales number" is captured in the Fact table. Fact table also contains the foreign keys for the dimension tables. What are the steps In Building the Data Model Ans: While ER model lists and defines the constructs required to build a data model, there is no standard process for doing so. Some methodologies, such as IDEFIX, specify a bottom-up What is Dimensional Modeling? Ans: Dimensional Modeling is a design concept used by many data warehouse designers to build their data warehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measurements ie, the dimensions on which the facts are calculated. What type of Indexing mechanism do we need to use for a typical data warehouse? Ans: On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types of clustered/non-clustered, unique/non-unique indexes. What is Normalization, First Normal Form, Second Normal Form , Third Normal Form? Ans: Normalization can be defined as segregating of table into two different tables, so as to avoid duplication of values. Is it correct/feasible develop a Data Mart using an ODS? Ans: Yes it is correct to develop a Data Mart using an ODS.becoz ODS which is used to?store transaction data and few Days (less historical data) this is what datamart is required so it is coct to develop datamart using ODS.

55

What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs? Ans: 1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. 2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. 3. Tuned the 'Project Tunables' in Administrator for better performance. 4. Used sorted data for Aggregator. 5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs 6. Removed the data not used from the source as early as possible in the job. 7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries 8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. 9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. 10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. 12. Tuning should occur on a job-by-job basis. 13. Use the power of DBMS. 14. Try not to use a sort stage when you can use an ORDER BY clause in the database. 15. Using a constraint to filter a record set is much slower than performing a SELECT WHERE. 16. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE. Overview:DataStage Interview Questions with Answers,DataStage Questions Answers,Explanations ,Solution Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. Ans: There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't). Tell me one situation from your last project, where you had faced problem and How did you solve it? Ans: The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster.

56

B. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted the former. Why do we have to load the dimensional tables first, then fact tables: Ans: As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables. How will you determine the sequence of jobs to load into data warehouse? Ans: First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the Aggregator tables (if any). What are the command line functions that import and export the DS jobs? Ans: A. dsimport.exe- imports the DataStage components. B. dsexport.exe- exports the DataStage components. What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? Ans: Use crontab utility along with dsexecute() function along with proper parameters passed. How would call an external Java function which are not supported by DataStage? Ans: Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job. What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job. Ans: A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive. B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file. Read the String functions in DS Ans: Functions like [] -> sub-string function and ':' -> concatenation operator Syntax: string [ [ start, ] length ] string [ delimiter, instance, repeats ] How did you connect with DB2 in your last project? Ans: Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and availability. Certainly DB2-UDB is better in terms of performance as you know the native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver 9.00.02.02' - ODBC drivers to connect to AS400/DB2.

DB2 and DB2-UDB http://bytes.com/topic/db2/answers/675110-db2-versus-udb

57

There never was a UDB product. "Universal Database" was a "property" added to versions of DB2 that had a certain extensibility options (e.g. LOBs, UDF, distinct types) (Informix calls it "Universal Server"). DB2 UDB V5 for LUW was the first to "earn" that label. Therefore many people chose to shorten DB2 UDF for LUW as UDB, implying that "DB2" is DB2 for zOS. Over the years DB2 for zOS and DB2 for iSeries also got the UDB label. Since all major vendors today have these extensibility options, the label as largely lost its meaning. DB2 9 (on either LUW or zOS) does not carry the UDB label anymore

What are Sequencers? Ans: Sequencers are job control programs that execute other jobs with preset Job parameters. Differentiate Primary Key and Partition Key? Ans: Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key How did you handle an 'Aborted' sequencer? Ans: In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again. Overview:DataStage Interview Questions with Answers,DataStage Questions Answers,Explanations ,Solution Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. Ans: There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't). Tell me one situation from your last project, where you had faced problem and How did you solve it? Ans: The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster. B. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted the former. Why do we have to load the dimensional tables first, then fact tables: Ans: As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables.

58

How will you determine the sequence of jobs to load into data warehouse? Ans: First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the Aggregator tables (if any). What are the command line functions that import and export the DS jobs? Ans: A. dsimport.exe- imports the DataStage components. B. dsexport.exe- exports the DataStage components. What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? Ans: Use crontab utility along with dsexecute() function along with proper parameters passed. How would call an external Java function which are not supported by DataStage? Ans: Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job. What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job. Ans: A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive. B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file. Read the String functions in DS Ans: Functions like [] -> sub-string function and ':' -> concatenation operator Syntax: string [ [ start, ] length ] string [ delimiter, instance, repeats ] How did you connect with DB2 in your last project? Ans: Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and availability. Certainly DB2-UDB is better in terms of performance as you know the native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver 9.00.02.02' - ODBC drivers to connect to AS400/DB2. What are Sequencers? Ans: Sequencers are job control programs that execute other jobs with preset Job parameters. Differentiate Primary Key and Partition Key? Ans: Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key How did you handle an 'Aborted' sequencer? Ans: In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.

59

Overview:Data Warehousing Questions and Answers.DataStage Interview Questions Answers,DataStage Questions with Answers What is the importance of Surrogate Key in Data warehousing? Ans : Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a database What does a Config File in parallel extender consist of? Ans: Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location. How many places you can call Routines? Ans:Four Places you can call (i) Transform of routine (A) Date Transformation (B) Upstring Transformation (ii) Transform of the Before & After Subroutines(iii) XML transformation(iv)Web base How did you handle an 'Aborted' sequencer? Ans: In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again. Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as EBCDIC using Datastage ? Ans: Currently, the total is converted to ASCII, even though the individual records are stored as EBCDIC. Compare and Contrast ODBC and Plug-In stages? Ans: ODBC : a) ODBC - Poor Performance. b) ODBC - Can be used for Variety of Databases. c) ODBC - Can handle Stored Procedures. Plug-In: a) Good Performance. b) Database specific (Only one database). What is Functionality of Link Partitioner and Link Collector? Link Partitioners partition data and Collectors bring them back to a single stream. What is the types and usage of containers. Ans: Containers: Usage and Types? Containers is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project. Explain Dimension modeling types along with their significance Ans: Data Modeling is Broadly classified into 2 types. a) E-R Diagrams (Entity - Relationships). b) Dimensional modeling. Did you Parameterize the job or hard-coded the values in the jobs? Ans: Always parameterized the job. Either the values are coming from Job Properties or from a Parameter Manager a third part tool. There is no way you will hardcode some parameters in your

60

jobs. How did you connect with DB2 in your last project? Ans: Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we needed to connect to DB2 for look-ups as an instance then we used ODBC drivers. What are the often used Stages or stages you worked with in your last project? Ans: A) Transformer, ORAOCI8/9, ODBC, Link-Partitioner, Link-Collector, Hash, Aggregator, Sort. How many jobs have you created in your last project? Ans: 100+ jobs for every 6 months if you are in Development, if you are in testing 40 jobs for every 6 months although it need not be the same number for everybody Have you ever involved in updating the DS versions like DS 5.X, if so tell us some the steps you have taken in doing so? Ans: Yes. The following are some of the steps; I have taken in doing so: 1) Definitely take a back up of the whole project(s) by exporting the project as a .dsx file. 2) See that you are using the same parent

Overview:DataStage Interview Questions asked in IBM,Dell,HP,Cognizant,TCS,Infosys,Wipro,Patni,CapGemini,HCL,Tech Mahindra,Cisco,Delloite


What is merge and how it can be done Please explain with simple example taking 2 tables? Is it possible to run parallel jobs in server jobs? What are the enhancements made in Datastage 7.5 compare with 7.0 If I add a new environment variable in Windows, how can I access it in Datastage? What is OCI? (natural connect dirvers). Is it possible to move the data from oracle ware house to SAP Warehouse using with Datastage Tool. (DS Plug-in) How can we create Containers? What is data set? and what is file set? How much would be the size of the database in Datastage ?What is the difference between Inprocess and Interprocess ? Briefly describe the various client components? What are orabulk and bcp stages?

61

ORABULK is used to load bulk data into single table of target oracle database. BCP is used to load bulk data into a single table for microsoft sql server and sysbase.

custom stages
Read more at http://www.geekinterview.com/question_details/23766#wh98hlhVbG660GYU.99 Datastage allows you to create your own stages with custom properties. There are three types of custom stages: Custom: This is Orchestrate based custom stage. Orchestrate operators are used for defining the functionality of the job . Build: This is C++ based custom stage. Functionality has to be specified in C++ code. This is then assembled with Build type stage. Wrapped: This is UNIX based custom stage. We can define the functionality by specifying the UNIX commands. These UNIX commands become the code of the stage.

This allows knowledgeable Orchestrate users to specify an Orchestrate operator as a DataStage stage. This is then available to use in DataStage Parallel jobs.
Custom:

This allows you to design and build your own bespoke operator as a stage to be included in DataStage Parallel Jobs.
Build:

This allows you to specify a UNIX command to be executed by a DataStage stage. You define a wrapper file that in turn defines arguments for the UNIX command and inputs and outputs.
Wrapped:

Wrapped stage
Enables you to run an existing sequential program in parallel

Build stage
Enables you to write a C expression that is automatically generated into a parallel custom stage.

62

Custom stage
Provides a complete C++ API for developing complex and extensible stages.

Overview:Data Warehousing Questions and Answers.DataStage Interview Questions Answers,DataStage Questions with Answers What is the importance of Surrogate Key in Data warehousing? Ans : Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a database What does a Config File in parallel extender consist of? Ans: Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location. How many places you can call Routines? Ans:Four Places you can call (i) Transform of routine (A) Date Transformation (B) Upstring Transformation (ii) Transform of the Before & After Subroutines(iii) XML transformation(iv)Web base How did you handle an 'Aborted' sequencer? job and then run the job again. Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as EBCDIC using Datastage ? Ans: Currently, the total is converted to ASCII, even tho the individual records are stored as EBCDIC. Compare and Contrast ODBC and Plug-In stages? Ans: ODBC : a) Poor Performance. b) Can be used for Variety of Databases. c) Can handle Stored Procedures. Plug-In: a) Good Performance. b) Database specific.(Only one database) What is Functionality of Link Partitioner and Link Collector? Ans: Containers : Usage and Types? Containers is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project. Explain Dimension Modelling types along with their significance Ans: Data Modelling is Broadly classified into 2 types. a) E-R Diagrams (Entity - Relatioships). b)

63

Dimensional Modelling. Did you Parameterize the job or hard-coded the values in the jobs? Ans: Always parameterized the job. Either the values are coming from Job Properties or from a Parameter Manager a third part tool. There is no way you will hardcode some parameters in your jobs. How did you connect with DB2 in your last project? Ans: Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drive. What are the often used Stages or stages you worked with in your last project? Ans: A) Transformer, ORAOCI8/9, ODBC, Link-Partitioner, Link-Collector, Hash, ODBC, Aggregator, Sort. How many jobs have you created in your last project? Ans: 100+ jobs for every 6 months if you are in Development, if you are in testing 40 jobs for every 6 months although it need not be the same number for everybody Have you ever involved in updating the DS versions like DS 5.X, if so tell us some the steps you have taken in doing so? Ans: Yes. The following are some of the steps; I have taken in doing so: 1) Definitely take a back up of the whole project(s) by exporting the project as a .dsx file. 2) See that you are using the same parent

Part 1

How did you handle reject data? Ans: Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected. If worked with DS6.0 and latest versions what are Link-Partitioner and Link-Collector used for? Ans: Link Partitioner - Used for partitioning the data. Link Collector - Used for collecting the partitioned data. What are Routines and where/how are they written and have you written any routines before? Ans: Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines: 1) Transform functions 2) Before-after job subroutines 3) Job Control routines

64

What are OConv () and Iconv () functions and where are they used? Ans: IConv() - Converts a string to an internal storage format OConv() - Converts an expression to an output format. How did you connect to DB2 in your last project? Ans: Using DB2 ODBC drivers. Explain METASTAGE? Ans: MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage. Do you know about INTEGRITY/QUALITY stage? Ans: Qulaity Stage can be integrated with DataStage, In Quality Stage we have many stages like investigate, match, survivorship like that so that we can do the Quality related works and we can integrate with datastage we need Quality stage plugin to achieve the task. Explain the differences between Oracle8i/9i? Ans: Oracle 8i does not support pseudo column sysdate but 9i supports Oracle 8i we can create 256 columns in a table but in 9i we can upto 1000 columns(fields) How do you merge two files in DS? Ans: Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate the 2 files into one if the metadata is different. What is DS Designer used for? Ans: You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links. What is DS Administrator used for? Ans: The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales. What is DS Director used for? Ans: datastage director is used to run the jobs and validate the jobs. we can go to datastage director from datastage designer it self. What is DS Manager used for? Ans: The Manager is a graphical tool that enables you to view and manage the contents of the DataStage Repository What are Static Hash files and Dynamic Hash files? Ans: As the names itself suggest what they mean. In general we use Type-30 dynamic Hash
65

files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size. What is Hash file stage and what is it used for? Ans: Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance. How are the Dimension tables designed? Ans: Find where data for this dimension are located. Figure out how to extract this data. Determine how to maintain changes to this dimension. Change fact table and DW population routines.
Part 3

Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. Ans: There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't). Tell me one situation from your last project, where you had faced problem and How did you solve it? Ans: The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster. B. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted the former. Why do we have to load the dimensional tables first, then fact tables: Ans: As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables. How will you determine the sequence of jobs to load into data warehouse? Ans: First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the Aggregator tables (if any). What are the command line functions that import and export the DS jobs? Ans: A. dsimport.exe- imports the DataStage components. B. dsexport.exe- exports the DataStage components.

66

What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? Ans: Use crontab utility along with dsexecute() function along with proper parameters passed. How would call an external Java function which are not supported by DataStage? Ans: Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job. What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job. Ans: A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive. B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file. Read the String functions in DS Ans: Functions like [] -> sub-string function and ':' -> concatenation operator Syntax: string [ [ start, ] length ] string [ delimiter, instance, repeats ] How did you connect with DB2 in your last project? Ans: Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and availability. Certainly DB2-UDB is better in terms of performance as you know the native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver 9.00.02.02' - ODBC drivers to connect to AS400/DB2. What are Sequencers? Ans: Sequencers are job control programs that execute other jobs with preset Job parameters. Differentiate Primary Key and Partition Key? Ans: Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key How did you handle an 'Aborted' sequencer? Ans: In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.

http://www.directutor.com/content/etl-performance-improvement-tips

67

ETL jobs run every day to pull data from Transactional OLTP database Servers and Load Analytical OLAP warehouse data bases takes more time than its expected, Follwing are tips that will help you improve the ETL performance. Following are the daily running ETLs along with the current (before applying Improvement Tips) timings:ETL Name ETL 1 ETL 2 ETL 3 Time Taken(Min) Before Optimization 132 (Avg. for Latest 15 ETL runs) 462 (Avg. for Latest 15 ETL run) 450-500 (Avg for Latest 15 ETLs.)

Views Definition Optimization:- There were 3 main views used to get the data from the main source tables. The existing definition of the views was:

SELECT ACT.* FROM S_EVT_ACT ACT(NOLOCK)JOIN S_SRV_REQ SR(NOLOCK) ON ACT.SRA_SR_ID = SR.ROW_ID SELECT ACT.* FROM S_EVT_ACT ACT(NOLOCK)JOIN S_DOC_AGREE ITR(NOLOCK)ON ITR.ROW_ID = ACT.AGREEMENT_ID WHERE ITR.X_QUICK_TICKET_ROW_ID IS NOT NULL SELECT ACT.* FROM S_EVT_ACT ACT(NOLOCK)JOIN S_PROD_DEFECT IR(NOLOCK)ON ACT.SRA_DEFECT_ID = IR.ROW_ID

Here the S_EVT_ACT, S_SRV_REQ, S_DOC_AGREE and S_PROD_DEFECT are main source Transactional database tables having huge data. While pulling from these views, the task was taking almost 1 Hr for each view. (This each view is used in 3 different ETLs). The reason for taking that much time was because of the join condition with main tables and for Optimizing those Views, Join condition between the main tables has been edited so that View should not query the S_SRV_REQ table for the View 1, S_DOC_AGREE for the view 2 and S_PROD_DEFECT for the view 3. So the views definition has been changed like following:1. SELECT ACT.* FROM S_EVT_ACT ACT(NOLOCK)WHERE ACT.SRA_SR_ID IS NOT NULL 2. SELECT ACT.* FROM S_EVT_ACT ACT(NOLOCK)WHERE ACT.AGREEMENT_ID IS NOT NULL 3. SELECT ACT.* FROM S_EVT_ACT ACT(NOLOCK) WHERE ACT.SRA_DEFECT_ID IS NOT NUL

above view definitions are doing look up in the same table rather then main table look up..
68

Results: After changing the views definition as shown above, it really has improved the performance. Now the data pull task is taking just 2-3 minutes against the 1 hour earlier time. So save of almost 50 minutes for each ETL.
1. SSIS Lookup Task (Null Column Handling):- There was one lookup task as a process of ETL processing through SSIS package that was taking almost 4 hrs each day means for each incremental day processing records it was taking almost 4 hrs that was very abnormal and was affecting the reporting very badly. So while analyzing the task following issues has been found:1. There was one column in the Source-Destination mapping that is not mapped with the source. So there was no data in this column and this column name was Country for this example. 2. Look up SQL was like following:-

SELECT ROW_WID,PERSONNELNUMBER,COUNTRY FROM WC_PERBIGAREA_D(NOLOCK) As Country or say any columns that is being used in the Look UP SQL is having NULL values then the Look Up SQL will always return the same no of records for each incremental run means As the comparison with NULL always return False so the Look Up will always iterate for all the records in the Incremental pull against the Null Country Record in the existing table and returns the Incremental multiplied Null values records. Thats why it was taking almost same 4 Hrs for each incremental run. Solution:- Following changes has been for improving the Look Up task performance beasically the NULL values has been handled:
2. Update Column(NULL):Updated the main WC_PERBIGAREA_D (NOLOCK) table for the COUNTRY column from NULL value to the relevant value by joining the tables to populate correct country value. 3. Modified Lookup Mapping: Modified the LookUp target mapping to include COUNTRY (Missing Column) mapping also so that from this point onwards the Country values should not become NULL and Should not effect the performance.

Results: After doing the above changes, the task is now completing in 10-15 minutes. Its again almost 4 hrs save in ETL execution time.
4. Dead Lock Prevention: Its always a best solution to prevent Deadlock to occur at very first place rather then letting Deadlock occur and then recovering from Deadlock. In one of the SSIS package task, there was truncation of main destination table and then loading fresh full data each time means each ETL run. 69

After analyzing what we observed was that at the time of Truncation of this particular table, as truncation needs exclusive lock, if any user is querying that table then it is getting in to dead lock/Hanged state until unless that blocking is cleared manually. So due to this, ETL some times hanged for 2-3 hrs or even a day also. It was really impacting the daily ETL Performance.

Solution:
5. Created new temporary table having same schema definition as main destination table that is being truncated in each ETL run. 6. In each incremental run, truncated only the temporary table and loaded data into the temporary table. So deadlock/blocking is prevented by not truncating the main table. 7. And for the data insertion from temporary table to the Main table followed the UPDATE/INSERT strategy. This way there was no hanging of ETL because now table is not exclusively locked and no user queries are affected and the ETL start running faster as there is no hanging point now. And the performance is also consistent.

Results:
8. 1. ETLs Performance improved because there are no blocking points. 2. Users Queries are not affected. 3. Consistent ETL Performance.

Following are the daily running ETLs timing after applying Improvement Tips: ETL Name Time Taken(Minutes) Solution Applied After Optimization ETL 1 68(Avg. for 15 ETL runs) Solution 1. ETL 2 151 (Avg. for 15 ETL run) Solution 1 & 3. ETL 3 200 (Avg. for 15 ETL Solution 1 & 2. runs )

Part 4

What versions of DS you worked with? Ans: DS 7.0.2/6.0/5.2

70

If worked with DS6.0 and latest versions what are Link-Partitioner and Link-Collector used for? Ans: Link Partitioner - Used for partitioning the data.Link Collector - Used for collecting the partitioned data. How do you rename all of the jobs to support your new File-naming conventions? Ans: Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Explain the types of Parallel Processing? Ans: Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. Ans: There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. When should we use ODS? Ans: DWH's are typically read only, batch updated on a scheduleODS's are maintained in more real time, trickle fed constantly What is the default cache size? How do you change the cache size if needed? Ans: Default cache size is 256 MB. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. What are the types of Parallel Processing? Ans: Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. How to handle Date convertions in Datastage ? Convert a mm/dd/yyyy format to yyyy-ddmm? Ans: We use a) "Iconv" function - Internal Convertion. b) "Oconv" function - External Convertion. Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,"D/M Differentiate Primary Key and Partition Key? Ans: Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as EBCDIC using Datastage ? Ans: Currently, the total is converted to ASCII, even tho the individual records are stored as EBCDIC.

71

How do you merge two files in DS? Ans: Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different. How did you connect to DB2 in your last project? Ans: Using DB2 ODBC drivers. What is the default cache size? How do you change the cache size if needed? Ans: Default cache size is 256 MB. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. What are Sequencers? Ans: Sequencers are job control programs that execute other jobs with preset Job parameters. How do you execute Datastage job from command line prompt? Ans: Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname How do you rename all of the jobs to support your new File-naming conventions? Ans: Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers.
Part 5

What is the importance of Surrogate Key in Data warehousing? Ans : Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a database What does a Config File in parallel extender consist of? Ans: Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location. How many places you can call Routines? Ans:Four Places you can call (i) Transform of routine (A) Date Transformation (B) Upstring Transformation (ii) Transform of the Before & After Subroutines(iii) XML transformation(iv)Web base How did you handle an 'Aborted' sequencer? Ans: In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.

72

Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as EBCDIC using Datastage ? Ans: Currently, the total is converted to ASCII, even tho the individual records are stored as EBCDIC.

73

Vous aimerez peut-être aussi