Vous êtes sur la page 1sur 4

Pushdown Optimization: A faster approach to data warehousing.

Ankit Solanki#1
www.informaticatutorial.co Mumbai, India
ankitsolanki@informaticatutorial.co | Abstract Today as the data within the organizations is increasing at an exponential rate and the C suite officers are asking their Business Analyst for more data in less time. The conventional methods of business intelligence are tweaked from every edge possible to meet the expectation of the company. Pushdown optimization is a way of load balancing among servers in order to achieve optimal performance. With the help of Pushdown optimization the data transformation logic is pushed on the database engine rather then handling them in data integration server. This has saved time, has lead to less dependency on the network as unnecessary I\Os are avoided, lead to optimum usage of the powerhouse, the database engine in a data warehousing architecture. Keywords Pushdown Optimization
1

ankit.solanki90@gmail.com

I. INTRODUCTION Over the next decade and beyond, the two variables that will gain attraction in the enterprise data integration equation aremore relevant data and reduced time. With the size data increasing at an exponential rate, the question that gains attention is, which is the right data integration strategy to effectively manage hundreds of terabytes of data with enough flexibility and adaptability to cope with future growth? The need for data integration came cause of the diverse platforms used to build an application. Different application used different relation database systems (Oracle, DB2 etc) or any other system or technique to store the data for eg. Flat files. The format in which data was stored differed, this gave motivation to build a method, which would convert all this data into a common format, or a standard format that could be later used for diverse purpose like reporting or mining for particular trend. It all started with developing hand-coded programs that extract data from diverse source systems, perform necessary business/transformation logic and then populate a target area, be it a staging area, data warehouse or other application interface. Hand coding demanded a lot of effort and hence it was replaced, in many instances, by data integration software that performs the access, extraction, discovery, integration, transformation and loading of data using an engine or data integration server and visual tools to monitor, implement and execute the desired process. In traditional data warehousing approaches when data is extracted from diverse sources of data it is transformed into a format as per the target table and the necessary business logic is implemented. To do this, a third party appliance like

Informatica PowerCenter from Informatica CORP (most widely used) or Talend open studio (open source data integration tool) etc are used. In this approach the data from the sources are loaded into the data integration server where the data is transformed and loaded into the target. The performance in the case of pulling the data into integration service or an engine, where the data is transformed and then loaded into the target, depends on two important factors Network and Processing power of the transformation engine. When the source and target are on two diverse databases the above approach is reasonable, as the data has to be loaded from one database to another so its legitimate to transform the data in a separate specialized engine. But when both the source and target are on same database this approach is not the best approach to process the data. Extracting and importing are not the strengths of relational databases. With the abovementioned approach most of the time is lost in loading and uploading the data after transformation and hence is not the most efficient approach. This is when Pushdown optimization should be used, the transformation logic is pushed on to the database and the transformation is carried out using the database engine and not a third party data integration tool. This helps in improving performance as a lot o time is saved since the data is not exported or imported out of the database. II. HISTORY OF DATA LOADING Data loading strategies have evolved over the time as the requirements have changed. Today the emphasis is on getting more data in less time. Historically, there have been four approaches to data integration. A. Hand Coding. Since the early days of data processing, IT has attempted to solve integration problems through development of handcoded programs. These efforts still proliferate in many mainframe environments, data migration projects, and other scenarios where manual labour is applied to extract, transform, and move data for the purposes of integration. The high risks, escalating costs, and lack of compliance associated with handcoded efforts are well documented, especially in todays environment of heightened regulatory oversight and the need for data transparency. Early on, solutions for automation emerged to replace hand coding as an alternative cost effective solution.

B. Code generators. The first early attempts at increasing IT efficiency led to the development of code generation frameworks that leveraged visual tools to map out processes and data flow but then generated and compiled code as the resultant run-time solution. Code generators were a step-up from hand-coding for developers, but this approach did not gain widespread adoption as solution requirements and IT architecture complexity arose and the issues around code maintenance, lack of visibility through metadata, and inaccuracies in the generation process led to higher rather than lower costs. C. RDBMS-centric SQL Code generators. An offspring of early generation code generators emerged from the database vendors themselves. Using the database as an engine and SQL as a language, RDBMS vendors delivered offerings that centered on their flavor of database programming. Unfortunately, these products exposed the lack of capability of the SQL language and the database-specific extensions (e.g., PL/SQL, stored procedures) to handle crossplatform data issues; XML data; the full range of functions such as data quality, profiling, and conditional aggregation; and the rest of the complete range of business logic needed for enterprise data integration. What these products did prove was that for certain scenarios; the horsepower of the relational database can be effectively used for data integration. D. Metadata-driven engines. This is a data integration approach that leveraged a data server, or engine, powered by open, interpreted metadata as the workhorse for transformation processing. This approach addressed complexity and met the needs for performance. It also provided the added benefit of re-use and openness due to its Metadata-centricity. Figure 1[1] shows this engine-based data integration approach.

perform the transformation of data in the relational databases. There were questions like Why are the we pulling the data to a integration server when all we plan to do is load the data in stage table? Doesnt this need a lot of I\O? Cant we simply implement this at database level rather then loading it in an integration server to save time? This is when a new concept called as pushdown optimization evolved. When both the source and target are on the same relational system (database) this method comes handy. The transformation logic from the Metadata engine, in such case can be pushed onto the relational database. Relational database are better know for its processing power rather then their data importing and exporting capabilities. So in the combined approach the data is not pulled into the Metadata engine rather the transformation logic is pushed onto the relational database. This reduces I/O and boosts performance. IV. PUSHDOWN OPTIMIZATION Push down optimization (PDO) is a feature provided by many data integration tools that allows transformation logic to be pushed into the target databases. This allows the dual benefit of usage of constructs of an integration tool along with the speed and power of database engine. Consider a scenario where the source and the target table are in the same relational database. Some rows from the source table/tables needs to be transformed and loaded into the target table which could also be the stage table so that this table can be used for loading some other table or databases. A traditional way to do the above task would be using ETL (Extract Transform and Load).

Fig. 2 Traditional ELT practise

Fig. 1 Meta-Data driven approach to data integration.

III. THE COMBINED: ENGINE AND RDBMS-BASED APPROACH FOR DATA INTEGRATION. The Metadata driven engine approach is the approach that is largely followed by organization across the globe but the thirst for a better approach to save time and improve performance has always encouraged innovative methods. In some scenarios it was realized that it would be better off to

Data would be extracted from the source table and loaded in working area or server of the Metadata engine or integration service. Transformation logic or the business logic developed in the integration service will transform the data. Post the transformation of the data the data will be loaded into the target table. The points to be highlighted in the above approach are: 1) Unnecessary I\O Both the source and target table were in the same relational database. Extracting the data to an external server then

transforming it there post transformation loading it again to the same relational database requires a lot of I\O. If the table has millions of rows then pulling these row invest in a lot of time, network bandwidth etc. 2) Inefficient utilization of resources. It is a well-known fact that the processing power of a relational database is more then that of an third party data integration tool. By implementing the business logic in the Integration service the powerhouse of the database ie the database engine is underutilized. Rather then using the traditional ETL approach we could have used pushdown optimization. In this the business logic from the integration service or data integration tool is pushed onto the relational database.

the scanning process stops, and all transformation objects downstream of this transformation are grouped together with equivalent SQL for execution inside the target system. ETL tool executes any remaining data transformation logic.[1] 2) Partial Pushdown processing. Partial pushdown processing occurs when either the source and target systems are in different database instances for example oracle and DB2, or only some of the data transformation logic can be represented in SQL. In such cases, some processing may be pushed into source database, some processing occurs inside Integration service or ETL tool, and some processing may be pushed to the target database.

Fig. 4 Partial pushdown

Fig. 3 Pushing the transformation logic from integration service onto the database.

The ETL logic is pushed to the database so there is no need to import the data to the integration service or data integration tool or ETL tool. Adding to this the processing is done at the powerhouse ie the database engine. Post the transformation the data is loaded into the same database hence again no additional processing is requited. V. TYPES OF PUSHDOWN OPTIMIZATION There are several ways in which Pushdown optimization can be implemented. 1) Two-pass pushdown processing. Pushdown processing is based on a two-pass scan of the mapping metadata. In the first pass, the ETL tool starts scanning the mapping objects starting with source definition object, moving towards the target definition object. When the scan encounters an object containing data transformation logic that cannot be represented in SQL, the scanning process stops, and all transformation upstream of this transformation are grouped together with equivalent SQL for execution inside the source system. In the second pass, the ETL tool scans in the opposite direction (i.e., from the target definitions towards the source definitions). When the scan encounters an object containing data transformation logic that cannot be represented in SQL,

In above Figure [1], all transformations up to and including the aggregate transformation are pushed into the source database. The Update Strategy transformation is executed within ETL tool as the database wont be able to do the same as it would require to load the entire table from other database into its RAM, and the Expression transformation is executed inside the target database as expression transformation would just have some minor trims or concatenation logic which can be done using a simple SQLs.[1] 3) Full Pushdown processing. Full pushdown processing occurs when the source and target relational database management systems are the same instance, and data transformation logic can be completely represented in SQL. In this case, integration service or ETL tool pushes down the entire mapping processing inside the database system.

Fig. 5 Full pushdown

In above Figure [1], the sources and targets are the same instance, and the data transformation logic instead of being carried out in the integration service could be pushed to the database. The work of the filtering, joining, and sorting the data is performed by the database, freeing integration service resources to perform other tasks. [1] Benefits of pushdown optimization: Implementation of pushdown optimization in data warehousing environment has proved out to be very beneficial for organization with large data warehouse. Following are the benefits of implementing pushdown optimization: 1) Increased Performance Performance increases exponentially as the I\O decrease since all the process ie Extraction, Loading and Transformation happen on the database engine rather then a third part appliance or a tool. 2) Optimized resource utilization With the use of database engine to perform the necessary transformation we use the powerhouse of our system to perform the desired operation using it to optimum level. 3) Less time Since all the operation are performed at one server ie the database engine, number of I\Os get reduced and a lot of time is saved as importing and exporting of data wont be required. VI. CONCLUSIONS

Todays challenges to save costs and also drive revenue are pushing organizations to examine their current data integration infrastructure needs and choose solutions that provide flexibility and maximum leverage of current assets. Pushdown optimization provides IT organizations with the flexibility to optimize performance in response to changing runtime demands, peak processing needs, or other dynamic aspects of the production environment, helping IT organizations achieve cost-effective scalability and performance. By delivering a combined engine-centric and RDBMS-centric approach to data integration in a single, unified platform ensures optimal performance for the broad spectrum of data integration projects and helps IT save costs through the intelligent use of existing computing resources.

ACKNOWLEDGMENT I wish to acknowledge Gagan Anand and Rajshekar Upadhaya who have been my gurus of Informatica. Also would like to thank Rhushikesh Dahiwale for constantly hearing my ideas out and suggesting me at places. REFERENCES
[1] [2] http://www.informatica.com/Images/06074_6675_pushdownoptimization.pdf. Push down optimization in a distributed, multi-database system [Online]. Available: http://www.google.co.in/patents?hl=en&lr=&vid =USPAT5590321&id=S7YoAAAAEBAJ&oi=fnd&dq=pushdown+opt imization+&printsec=abstract#v=onepage&q=pushdown%20optimizat ion&f=false

Vous aimerez peut-être aussi