Vous êtes sur la page 1sur 15

Technical Paper

Optimize ETL for the Banking DDS


Maximize and optimize downstream ETL when the banking
detail data store (DDS) is deployed in a database

Optimize Downstream ETL in the Banking DDS

Background
The banking DDS is an industry data model that provides an integrated data backplane for SAS Banking Solutions. The banking
DDS has been in development since early 2001, and the first production release was in 2004. Originally, the banking DDS
supported only SAS Customer Intelligence Solutions. It has grown consistently over time to support more solutions such as SAS
Customer Analytics for Banking, SAS Credit Scoring for Banking, SAS Credit Risk for Banking, and SAS Risk Management for
Banking.
Initially, the banking DDS was supported only in SAS data sets. Based on customer demand, the banking DDS is now supported
on SAS data sets, SAS Scalable Performance Data Server (SPD Server), Oracle, Teradata, DB2, and Microsoft SQL Server.
In the overall data flow, the banking DDS is the hub or data repository for downstream SAS solutions. ETL processes flow from the
banking DDS to the solution data marts, and some solutions have ETL processes that write back the results to the banking DDS.
Figure 1 shows the overall data flow.

SAS Detail Data


Store

Enterprise Operational/
Data Warehouse Sources

Solutions

ETL

Extract and
Transform

Reference
Data

SAS Solution

Write Back

Staging Area

Source Systems (OLTP/ERP etc)

Master
Data

Extract and
Transform

5
ETL

Load

EDW

Solution Data
Marts

SAS Solution

Transaction
Data

Write Back

6
ETL
SAS Solution

2
Write Back

Figure 1: Banking DDS Data Flow

2
The banking DDS uses two types of data flowsan acquisition ETL flow and a data delivery ETL flow.

Acquisition ETL Flows


The acquisition ETL flows (1, 2, and 3 in figure 1) acquire data from the source systems to populate the banking DDS tables.

Data Delivery ETL Flows


The data delivery ETL flows (4, 5, and 6 in figure 1) use the banking DDS as the source to populate the tables in the solution data
marts. Data delivery ETL is prebuilt and delivered to the customer with each SAS solution.
It is critical that the banking DDS and ETL processes be deployed for optimal performance. When the banking DDS data is not
stored in SAS data sets, optimizing performance requires special attention.

Overview
The intent of this project was to load data into the banking DDS tables in order to simulate a customer environment and learn how
to improve performance. ETL jobs deployed from SAS Risk Management for Banking were used in this project.
The banking DDS data model was deployed in SAS and three other database systems. Generated data was loaded to test the
different techniques for fine-tuning the data structure for the banking DDS and for improving the performance of selected ETL jobs.
Specifically, techniques that enabled the ETL jobs to run inside the databases were tested. The scope of the project focused on
what could be done by fine-tuning the ETL without rewriting the logic for the job.

Project Setup
The project was set up with three tiers for the testing environment. All of the databases, except SAS, resided on its own server.
The SAS Metadata Server was set up on a separate server, and it also functioned as the SAS server. Separate client workstations
for managing the banking DDS and SAS Risk Management for Banking ETL metadata and for executing ETL jobs were
established.

Optimize Downstream ETL in the Banking DDS

Figure 2: Project Setup


The first step was to analyze the data available in the banking DDS to determine what ETL jobs could be run with the available
data. To determine this, it was important to know how SAS Risk Management for Banking ETL jobs were organized. Conceptually,
these ETL jobs extracted data from the banking DDS and populated SAS Risk Management for Banking (specifically, the staging
tables).

Figure 3: SAS Risk Management for Banking ETL Jobs


Some ETL jobs (specifically, staging ETL jobs) accessed data in the banking DDS and populated SAS Risk Management Banking
directly. Other ETL jobs (intermediate ETL jobs) populated SAS Risk Management for Banking intermediate tables first, and then
populated staging tables later.

Figure 4: SAS Risk Management for Banking ETL Jobs (Continued)


After analyzing the data, the best ETL jobs to run were the following:

RMB_INT_100_I_EQUITY_QUOTE

RMB_STG_230_QUOTE_EQUITY

RMB_INT_I_RISK_FACTOR

The first job takes stock quote data from the banking DDS and loads it into the intermediate tables.
The second job takes the stock quote data from the banking DDS and loads it into the final staging table format for SAS Risk
Management for Banking.
The third job creates an intermediate table for risk factor data. This table is used by other staging ETL jobs.
Attention was focused on the RMB_INT_100_I_EQUITY_QUOTE job. To provide adequate data and to obtain meaningful results,
the banking DDS EQUITY_QUOTE table was populated with 200 GB of data (2,029,962,547 rows of data).

Metadata Environments
To test ETL job performance with the banking DDS deployed in SAS and the three other database systems, the same banking
DDS tables and ETL jobs had to be deployed in separate metadata environments. Custom repositories were set up under the SAS
Metadata Server. Each environment had a custom repository.

Testing the Performance


The goal of this project was to measure various performance enhancements and to limit changes to the ETL jobs. By limiting
changes, performance enhancements could be applied quickly and easily, and the original data flow design could be maintained.
No additional data tables would need to be created, and no data tables would need to be modified. Tests could be run repeatedly,
and performance could be measured. In the end, there would be consistency in the results.

Optimize Downstream ETL in the Banking DDS

Out-of-the-Box ETL Jobs


The ETL jobs provided with SAS Risk Management for Banking populated three general types of tables: the underlying work tables
that were created in SAS Data Integration Studio, the intermediate tables, and the staging tables. For the first test, the banking
DDS tables and all of the downstream tables (work, intermediate, and staging) were deployed on SAS. Then, in three separate
tests, all of the tables were deployed on each of three different database systems. In the following examples, attention was focused
on one ETL job that created the intermediate table that provided the equity quote data for SAS Risk Management for Banking.

Figure 5: Workflow of RMB_INT_100_I_EQUITY_QUOTE Job


The RMB_INT_100_I_EQUITY_QUOTE job extracted all rows of the data and a subset of columns and loaded the data into a
temporary data set. This data was aggregated by the max quote data into a second temporary data set, and then joined with the
source DDS data table to extract additional columns. As a result, a third and final temporary data set was created and loaded into
an intermediate table. The logical workflow is shown in Figure 6. Optimizing this ETL job depended on the locations of the source
and target data in the environment. Localizing the transfer of data from one table to another was the key to optimal performance.
And, the original data flow design was maintained.

Figure 6: Logical Workflow of RMB_INT_100_I_EQUITY_QUOTE Job


In this project, SAS code was extracted from the job and then modified. This choice localized the processing to the database,
instead of SAS performing processing outside of the database. For this to happen, PROC SQL code was rewritten to use indatabase processing or explicit pass-through. Explicit pass-through is invaluable in the event of implicit SQL query disqualification.
Disqualification occurs when the code violates implicit pass-through rules. In the following figure, the differences between implicit
pass-through and explicit pass-through are shown.

Explicit Pass-Through

Implicit Pass-Through

Performance Scenario 1: Data Sets in SAS


For this scenario, the SAS environment was set up on a single server (the SAS Metadata Server) to record the time it took to move
the 200 GB source data set (the EQUITY_QUOTE table) from the DDS to the intermediate data set. The initial step was the most
time-consuming because it moved a subset of columns and all rows and created a 136.5 GB temporary data set. The work area for
this server was located on a different file system of the same server. The I/O cost of reading the data from one file system and
writing it to a second file system was high. Writing the data to tables on a second file system is a single-threaded operation, which
hampers performance.

Figure 7: Logical Overview of ETL with All SAS Data Sets

Performance Scenario 2: Source DDS Data Table in Database


For this scenario, only the source DDS data table was hosted on the database. The I/O cost of moving the data from the source
DDS data table to the work table was high. Reading the data from a remote database server and writing it to the local work file
system across the network increased the I/O cost. Operations after moving data took similar amounts of time as performance
scenario 1.

Optimize Downstream ETL in the Banking DDS

Figure 8: Logical Overview of ETL with Source DDS Data Table Hosted on Database

Performance Scenario 3: All Tables in Database


For the final scenario, all tables (source, transitional, and target) were hosted in a database. Because each database has different
specifications and node counts, it would be difficult to compare each databases performance. Instead, the performance benefits of
moving localized data and having parallelized read and write capabilities were analyzed. Each massively parallel processor (MPP)
databases coordination, aggregation capabilities and transfer of data technology are different. In general, for an MPP database
environment, data resides on many nodes and is distributed by a hashing algorithm and managed by software. SQL is formatted by
the software, and work is transferred to individual nodes. The final result data sets are combined and transferred to a client. For
this ETL job running on an MPP DBMS, distributing the data using the same data set key allows the data and work to remain local
to the node and avoids rehashing the data across the cluster. The goal is to do as much work as possible in the database (locally)
before transferring the data out of the database.

Figure 9: Logical Overview of ETL with All Data Tables Hosted on Database

Reducing the I/O Bottleneck


In the default EQUITY_QUOTE job, a SAS Extract job was used to immediately copy a subset of columns from the
EQUITY_QUOTE table into a SAS data set. The EQUITY_QUOTE table consisted of 2 billion rows of data, which averaged out to

8
be ~200 GB of data. This extraction of data required creating approximately 2 billion rows in a SAS data set. The I/O cost of this
job was severe. And, the jobs performance was heavily impeded. After the data was moved into SAS, the various tasks in the SAS
Extract job took considerably less time. The following figure shows the workflow of and the different tasks in the SAS Extract job.

Figure 10: Default EQUITY_QUOTE Job Workflow and Tasks


Task 1: Extract all rows from the source DDS data table and only 6 of the 9 columns from the source DDS data table. In the default
EQUITY_QUOTE job, this is the task where data is transferred from the source DDS data table to a SAS data set. Task 1 creates a
new data set in SAS by copying 2 billion rows from the source DDS data table.
Task 2: Filter data with an aggregate function and GROUP BY clause. The output is only about 565,000 rows of data, which
averaged to approximately 30 to 50 MB of data.
Task 3: Join the resulting work data set from task 2 with the EQUITY_QUOTE table to get two columns of data. This task is not
part of the extraction in task 1.
Task 4: Load the data into the I_EQUITY_QUOTE table, which is a SAS data set.
To reduce the bottleneck of creating 2 billion rows (~200 GB) of data in a SAS data set, this creation of data needed to be delayed
in the workflow. The optimal time to copy the 2 billion rows of data was after the SQL joins had filtered the data. Instead of
extracting data from the source DDS data table and using it to create a SAS data set in task 1, tasks 1 through 3 were performed in
the database. Then, the data was copied to a SAS data set. The data was copied to SAS by sending the output of task 3 to the
SAS data set. The techniques described in Appendix A: SAS Data Integration Studio Techniques were used. The default job
workflow and tasks were modified as shown.

Figure 11: Modified EQUITY_QUOTE Job Workflow and Tasks


Specifically, each task was modified as described in the following paragraphs. The I/O bottleneck was significantly reduced.
Task 1: Within the database, an SQL join creates a view to the source DDS data table. It does not create a copy of 2 billion rows of
data from the source DDS data table. By creating a view, all of the data already in the source DDS data table is not duplicated.
Task 2: Within the database, filter data with an aggregate function and GROUP BY clause. Store the output in a table in the

Optimize Downstream ETL in the Banking DDS

database. The output includes about 565,000 rows (approximately 30 to 50 MB).


Task 3: Join the resulting in-database data set from task 2 with the EQUITY_QUOTE table to get two columns of data. This task is
not part of the extraction in task 1. The output of this task is loaded into a SAS data set. The I/O cost is reduced because only
565,000 rows of data are created, not 2 billion. In summary, only 30 to 50 MB of data was moved instead of 200 GB.
Task 4: Load the data into the I_EQUITY_QUOTE table, which is a SAS data set.

Conclusions
In our testing, minimal changes were made to the source DDS data table (EQUITY_QUOTE). The ETL jobs that were tested used
the highest volume of data that was available. By using the EQUITY_QUOTE table (the largest source DDS data table), the ETL
job had minimal filter and join conditions. This information governed how data was distributed in an MPP database. The sample
data had the highest cardinality data in the QUOTE_DTTM column. This made the QUOTE_DTTM column a good candidate for
the distribution key. Because data is different for each customer, there is no one-size-fits-all method for optimizing performance.
The strategy should be to minimize I/O by localizing the work, intermediate, and stage tables. If tables can be localized, there will
be a performance improvement.
The default ETL job created many data sets. Then, data was moved from one location to another. Moving this data caused the
largest I/O bottleneck. The largest move within this I/O bottleneck was the initial move of data from a 200 GB source DDS data
table to a 136.5 GB SAS data set. By localizing the data and moving it from one table to another table within the database, or by
removing the data move all together by creating a view to the database, job time was cut drastically.
The results are displayed in the following table. Each database environment was running on different hardware. In these results, it
is more important to see the individual gain in performance, not a comparison of performance. So, the databases are not labeled
so that they are not compared. SAS values are provided as benchmarks.

Total Average Time for


Non-Optimized ETL (in
hours)

Total Average Time for


Optimized ETL (in hours)

SAS
Database A
Database B

1.65
Over 8.00
6.58

Performance Gain Total (in


hours)
N/A
N/A
0.10
Over 8.00
0.13
6.59

Database C

Over 8.00

1.62

Database

Over 8.00

SAS solutions such as SAS Risk Management for Banking often include out-of-the-box ETL jobs. When either the source table or
the target table of these ETL jobs is changed from a SAS data set to a database table, ETL performance is affected. In our testing,
we dramatically improved ETL performance in SAS Risk Management for Banking by making minor changes to the ETL jobs
provided with the solution. Although these changes were made to a specific set of ETL jobs, the underlying techniques can be used
to change any ETL job. The bottom line is performance is optimized when ETL code executes in the database.

10

Appendix A: SAS Data Integration Studio Techniques


This appendix describes the SAS Data Integration Studio techniques that were used to change the out-of-the-box jobs to execute
in a database.

Check Database Processing


The Check Database Processing button (highlighted in Figure A.1) enables a user to check the compatibility of SAS Data
Integration Studios generated code that is running in the database.

Figure A.1: The Check Database Processing Button


After the code has been checked, nodes that will execute in the database are marked with a database-specific icon in the upper
right corner of the node. An example is shown in Figure A.2.

Figure A.2: Nodes That Will Execute in the Database


After the code has been checked, the user can check the SAS log to determine what code ran in the database. By default,
information in the log that helps determine whether code ran is limited. By specifying the SASTRACE option in the pre-code of a
job, the log will contain more information that can help determine whether code ran in the database. The command to specify the
SASTRACE option is:
options sastrace=',,,d' sastraceloc=saslog nostsuffix;
Figure A.3 shows a SAS log after specifying the SASTRACE option. The left side contains a simple CREATE TABLE statement
that selects from a single table. (This code was created by an SQL join node with a single input.) The right side contains a section
of the log that is generated by the SASTRACE option. On the right side, the user can see from the SRC line that the table
W58REMXP is coming from the DDSINT library.

Optimize Downstream ETL in the Banking DDS

Figure A.3: SQL Join with Single Input (left) and Output Generated by the SASTRACE Option (right)

Replace SAS Extract with SQL Join


In SAS Data Integration Studio, the SAS Extract job is not supported to completely execute in the database. If the user clicks the
Check Database Processing button, the message Unable to completely process on [database] will appear. A workaround is to
replace the SAS Extract job with an SQL join that has a single input. By having a single input to the SQL join, the same logical
functionality of the SAS Extract job is performed, and the ability to generate SAS pass-through code is gained. By default, an SQL
Join node has two input ports. To emulate the SAS Extract job, one input port must be removed. Figure A.4 shows how to remove
the extra input port.

Figure A.4: Removing an Extra Input Port

12

Generating Pass-Through Code


Once the SAS Extract job has been replaced with an SQL Join node, the SQL Join node can be configured to generate SAS passthrough code so that it can execute in the database. To set the pass-through property, double-click the SQL Join node. The SQL
Join Properties window is displayed. Set the Pass Through property to Yes. Figure A.5 and A.6 show the SQL Join Properties
window and the generated pass-through code.

Figure A.5: The SQL Join Properties Window

Figure A.6: Generated Pass-Through Code

Replacing Functions Specific to SAS


Another reason that code might not execute in the database is because functions specific to SAS have not been converted to
database macros. For example, the DATEPART(<column> as date) function is specific to SAS, and it has not been converted to
the database macro equivalent in the SAS/ACCESS engines. This function will not execute in the database until an equivalent
database macro has been created.

Redirecting SAS Work Tables to the Database


Creating SAS data in SAS Work tables can cause another I/O bottleneck. By default, the output node of an SQL Join is a SAS
Work table. Even if an SQL join is executed on a database, there will be extra I/O to transfer the data to the SAS Work table from
the database, and vice versa. To avoid this I/O bottleneck, redirect the output of the SQL Join node to a specific library created in
the database. Right-click on the output node of an SQL Join, and click Properties.

Optimize Downstream ETL in the Banking DDS

Figure A.7: Select Properties


After the SQL Join Properties window appears, click the Physical Storage tab, and change Location from Jobs default library
for temporary tables to Redirect to a registered library. Choose the same database library that you are using for the source
DDS data table.

Figure A.8: Redirect the Output