Vous êtes sur la page 1sur 5

A Compatrive Study of ETL Tools

Sana Yousuf Sanam Shahla Rizvi


Department of Computer Science Department of Computer Science
Military College of Signals, National University of Military College of Signals, National University of
Sciences & Technology Sciences & Technology
Islamabad, Pakistan Islamabad, Pakistan
sn_ysf@yahoo.com ssrizvi@mcs.edu.pk

AbstractIn many organizations valuable data is wasted transformation and loading of data. The improvements
because it lies around in different formats and in various include capabilities such as data profiling, data quality
resources. Data warehouses (DWs) are complex systems having control, monitoring and cleansing, real-time and on-demand
consolidated data with an objective to assist the knowledge data integration in a service oriented architecture, and
workers in decision making process. The key components of
metadata management [12]. Moreover ETL tools are now
DWs are the Extraction-Transformation-Loading (ETL)
customizable according to the functional requirements of an
processes. Since incorrect or misleading data may produce
wrong decisions. This necessitates the selection of appropriate enterprise data warehouse.
ETL Tools for a DW to improve data quality. The selection of a) Extraction
ETL tool is a complex and important issue in data Being the first step in the ETL process its focus is on
warehousing because it validates the quality of a data extracting data from different source systems. These sources
warehouse. This paper first highlights the ETL process briefly are named as source system because they could be internal,
then discuses some of the ETL tools available along with a external, structured or unstructured i.e. of any type. Thus
general criterion used as measuring parameters for selecting sources systems could be mainframe applications, flat files,
appropriate ETL tools. At the end an analysis of the tools ERP applications, relational databases, non-relational
based on the generalized criteria is presented to give an insight
databases, CRM tools or even message queues. These
of which tool is better for which circumstance.
sources may have different formats of data i.e. different
Keywords: Dataware houses, ETL tools, complex systems, internal representation making Extraction a difficult process.
enterprise systems So an extraction tool should be able to :
- Understand all different data storage formats
- Have a communicative ability among various
I. INTRODUCTION relational databases
Data Warehouse is a large data repository that - Read & understand different file formats used in an
consolidates various types of data transformed into a single organization.
suitable format. Depending on specific business needs it can - Extract only relevant data before bringing it into to
be architectured differently. However in general data stored the DW.
in operational databases is transferred to a data ware house b) Transformation
pre processing platform also known as staging area, then The transformation phase ensures the data consistency
after processing into the data ware house and lastly is and performs data cleansing before loading data in the data
transformed into sets of conformed data marts warehouse. In order to transform the data properly, a number
A. ETL Process and Concepts of rules and business calculations are applied to the extracted
Extract, Transform and Load (ETL), is an important data so that different data formats are mapped into a single
component of the Data Warehousing Architecture. The format. Transformation can be integrated with extraction or
process includes extraction of data from various data loading phase depending upon when it is performed.
sources, transformation of extracted data according to c) Loading
business requirements and loading of that data into the After transforming and cleansing the extracted data, it is
dataware house. loaded into fact and dimension tables of the data warehouse
Any programming language can be used to make an ETL to be used for various analytical purposes. It is done
process however making it from bits and pieces is quite regularly to avoid data stacks to get piled up. It can be
complex. Various ETL tools are available in the market required in one of the two situations:
easing an enterprise to select one based on its requirements - Load the new data that is currently contained in the
& needs. With the passage of time these tools have matured operational database
and now provide much more than just Extraction,
- Load the updates corresponding to the changes A. Pentaho Data Integration
occurred in the operational database Pentaho [12] is a commercial open-source Business
Reference [3] states that incremental loading is the Intelligence suite along with a data integration product
preferred approach to data warehouse refreshment because it named Kettle. Using the innovative meta-driven approach it
generally reduces the amount of data that has to be extracted, is fast having an easy to use GUI. Having started in 2001 it
transformed, and loaded by the ETL system. ETL jobs for has grown and today it has a strong community of 13,500
incremental loading require access to source data that has registered users. It also supports multi-format data and
been changed since the previous loading cycle. For this allows data movement between many different databases and
purpose, so called Change Data Capture (CDC) mechanisms files.
at the sources can be exploited, if available. Additionally,
ETL jobs for incremental loading potentially require access B. Talend Open Studio
to the overall data content of the operational sources. Talend Open Studio (TOS) [10]is another tool with
The paper provides an insight to the background of ETL support of data integration and is open source. Started in
tools in following section. Section III presents brief overview 2006, has a less community of followers but still has quite a
of the various ETL tools. Section IV focuses on setting the market share as 2 supporters are finance companies. Rather
criteria to rank available tools. Section V on the other hand than metadata driven it uses a code driven approach and has
presents a comparative analysis of various tools. Paper is a GUI for user interaction. The code generation property
ended by a conclusion of the overall study in section VI. allows generating executable code of Java and Perl that can
be run later on a server.
II. BACKGROUND OF ETL TOOLS
An ETL tool provides a certain set of basic ETL C. Informatica Power Center
processing facilities, as explained in section I, to rank it as a Informatica Power Center (IPC) [3] is not an open source
proper ETL tool. Since 2003 Passionned, a consultancy and software but is commercially a recommended data
research firm, has been closely monitoring the market for integration suite and thus the market share leader in data
both ETL and data integration tools [4]. Earlier the surveys integration tools. Found in 1993, it has made its place in
conducted were based on the main market driving entities market with consistency and leadership, today it has 2600
also known as visionaries. Many organizations used to registered users out of which 100 are included in list of stock
assume that they had automatically made the right choice if exchange companies. The main focus of IPC is on data
they purchased a tool from one of the market leaders. integration with numerous capabilities e.g. enterprise size
However the trend changed over time and then organizations architecture, data cleansing, data profiling, web servicing and
started making ETL tools for according to their requirements interoperability with current and legacy systems.
themselves.
D. Inaplex Inaport
Since the late nineties, all the major business intelligence
(BI) vendors had purchased or developed their own ETL Inaplex [12] provides mid-market solutions focusing
tools. BI tools had more reliable ETL processes and a well customer relationship management for customers data
designed method of keeping the data warehouse. BI provided integration. Besides the customer relationship management it
a better solution but it consumed 70 -80% of the costs also lays emphasis on providing simple solutions for data
involved in a successful BI system. integration and accountancy handling.
Passionned in its ETL Tools survey 2009 described the E. Oracle Warehouse Builder
importance to evaluate and promote ETL tools because many
organizations still built their data warehouses by hand i.e. The Oracle Warehouse Builder (OWB) [13] is a
writing complex PL/SQL or SQL and stored procedures. The comprehensive tool for ETL, relational and dimensional
focus of such surveyors was that developer productivity modeling, data quality, data auditing, and full lifecycle
would be increased by a factor of 3-5 times if a proper ETL management of data and metadata [13]. It allows high
tool was used. Thus if a proper guidance was available to performance, security and scalability by having Oracle DB
enterprises then choosing the right product would become as the metadata repository and transformation engine.
easier and less risking for he organization itself. As F. IBM Information Server
explained by reference [5] construction of data ware houses
through ETL tools resulted in a better, stable and more A product by IBM (IS Datastage) [10] & is well known
reliable data-ware house that allowed more aspects to be for its services. The capabilities of the tool include data
checked and monitored in relation to each other. Companies consolidation, synchronization, and distribution across
on their own official websites also present a comparison of disparate databases, automatic data profiling & analysis in
their offered product with other market competitors; Adeptia terms of content and structure, data quality enhancement,
[10], Microsoft SSIS and informatica [3] are such examples. transformation and delivery to and from complex sources i.e.
capability to get data from any sources format and deliver it
III. SOME FAMOUS ETL TOOLS to any targets, within or outside the enterprise, at the right
time.
Some famous ETL tools available in market are as follows: It also allows integration and information access for
diverse data and content regardless of the placement of data.
With the data replication services customer information D. Reusability
management can be done quickly. The reusability depends on that the components of a data
G. Microsoft SQL ServerIntegration Services ware house architecture, which is constructed using the ETL
tool, must be reusable and can handle parameters. The tools
Microsoft SQL Server Integration Services (MS SSIS) should be capable of dividing the process into small building
[14] allows run time data transfer and management. blocks, allow user to make user defined functions and
Designed for enterprise wide application support, it provides allowing these functions to be used in the process flow.
a platform for performing ETL functions and creating and
controlling data packages. It allows formation of script E. Connectivity
application using .net platform support, increased scalability The main aspects to consider include the native
with thread pooling, and a more advanced import and export connections the tool supports, the packages its can read
wizard. It also allows customization of the package suiting metadata from, the type of message queuing products the
specific organization needs, usage of digital sign for security tool can connect to, capability to graphically join tables,
and supports service oriented architecture. support for changed data capture principle, transformation
IV. ETL TOOL FEATURES matching and address cleansing ability as well as options for
data profiling uniqueness and distribution etc.
With the available span of functionality and quite a
number of ETL tool vendors it is quite difficult to rank all F. Interoperability
the variety of tools as every tool has some special features Last but not he least the tool should be capable to run on
too. Some generic behaivour has been identified by [5] on a number of platforms and also on the different versions of a
the basis of which following comparison and graph making product.
is done.
Following general aspects can be kept in mind when V. ANALYSIS OF ETL TOOLS
evaluating an ETL tool With all the aspects, as discussed in section IV, in mind
A. Architecture an analysis of the services provided by the tools is discussed
hereafter. Thus in choosing any tools its respective aspects
For evaluating any tool with respect to architecture should be considered. Following graph based analysis
aspects such as support for parallel processing, symmetric provides support for the decision making. For this analysis
multiprocessing, massive multi processing, clustering, load various websites, vendors white papers, web-blogs,
balancing and feasibility for grid computing should be comparisons and previous surveys were consulted and thus
considered. Also support for multi user management of ETL based on the basic set of features discussed in section IV the
processes running on multiple machines and support for analysis was conducted.
common meta-model i.e. allowing for exchange of meta data Each of the above mentioned ETL tools, as discussed in
with self brand and other brands is to be considered too. section III, is graded on the basis of points according to the
B. Functionality level of services supported while the vendors are depicted by
the acronyms in graphs instead of full names.
Two main aspects relating to functionality of an ETL tool
are important i.e. the metadata support and the overall A. Architectural Aspects
functionality provided by the tool. Based on the support of enterprise architecture,
The main functionality focuses of whether the tool is data clustering, data separation into groups, Web based
cleansing oriented or data transformation oriented, or it application interface support & cloud computing deployment
performs both equally. Thus one gets a clear picture of what support following graph depicts the current services
tool to select depending on the nature of data that shall be put supported by tools.
into the tool. Also the support for direct connection to data Thus IPC and OWB are nice in architectural support with
source for input is also an important aspect of functionality. SSIS coming up right behind.
On the other hand support of metadata is a key aspect
too. An ETL is also responsible of using metadata to map B. ETL Functionality
source data to destination. Thus choosing a tool that Depending upon completeness of tools in terms of
conforms to organizations metadata strategy is very functionality points have been given. Thus support for data
important. cleansing, transformation, support for integration services
C. Usability and common metadata model support are the main aspects
considered. The graph is drawn by adding up the points
The usability is one of the important factors of any tool. granted to each tool depending upon the support it provided
Thus points to consider are that the tool should be easy to i.e. one point for each aspect and then adding up those points
use, understand and fast to get used to. In this regard aspects which fall into one category. Same case was done for both
of concern are that tool should have a well balanced interface trends i.e. basic functionalities in 2007 and improvements till
and must support the typical tasks sequence as of any ETL 2010.
usage.
Architectural Aspects
35 Points Ease Of Use
30 8

25 7
6
20
5
15
4
10
3
5 2

0 1
IBM IS I PC Talend OWB MS SSIS BO SAP SAS DIS Others 0
OS
Web-based UI Clustering and Job Distribution IBM IS I PC Talend OWB MS SSIS BO SAP SAS DIS Others
Enalbes SOA Deploy in Cloud Option OS Vendors
Original 2007 Improvement 2010

Figure 1. Architectural Support Figure 3. Usability

ETL Functionality Provided Reusability


40
2007 2010 improvement
35
50
45 30
40 25
35
20
30
Points

15
25
20 10
15 5
10
0
5
IBM IS I PC Talend OWB MS BO SAS Others
0 OS SSIS SAP DIS
IBM IS I PC Talend OWB MS BO SAS Others
OS SSIS SAP DIS Reusable service Repository Split Data Streams
Vendors Data Partitioning Automatic Documentation

Figure 2. Functionality Figure 4. Reusability

C. Usability E. Connectivity
This graph covers all the points graded to a tool on the Connectivity as the name indicates is calculated by
basis of an easy to use, a well designed and a balanced aggregating the points granted to a tool on the following
interface. What you see is what you get (WYSIWYG) and aspects. These include total number of all the sources which
task compatibility also is other basis of grade. Each point could be read in without any additional middleware, the
graded gets accumulated by the existence of a subset of enterprise applications supported by the tool, the platforms it
services necessary of ease of use and understanding. Also can run on and last but not the least the support for
ease of training new users to become used to the interface is messaging (i.e. real time data handling).
a part of criterion.
F. Interoperability
D. Reusability The support of various platforms in detail is provided in
The graph, as follows, depicts a comparison and point following graph. Here all Windows & Linux versions are
grading on basis of reusability factor supported, capability of considered as one while UNIX versions are catered
data stream splitting, automatic documentation and support separately.
for definition of user defined functions and using them in the
process flow.
VI. CONCLUSION
Connectivity
Platfroms Data Sources Packages Messages
Important data in most of the organizations is under
100 utilized just because it exists around in different formats and
in various resources. Data warehouses (DWs) are complex
90
systems having consolidated data with a main objective to
80 assist the knowledge workers in decision making process.
70 The key components of DWs are the Extraction-
Transformation-Loading (ETL) processes. The goal of this
60 paper is to elaborate ETL process, its importance relevant to
Points

50 the data warehouses and provide a comparison based on


40
some generalized criteria to find suitability of a tool for a
certain category of consumers. The paper provides a brief
30 overview of the available ETL tools in market, specifies
20 some key points that can be made for generalizing
capabilities provided by a tool and using graph based
10
analysis on a grade point scale to grade the specific tools
0 selected. This all provides a comparison of the available
IBM IS I PC Talend OWB MS BO SAS Others tools in terms of the features they provide helping an
OS Vendors SSIS SAP DIS
organization choose which tool will best suit its needs.
Figure 5. Connectivity
REFERENCES
100
Interoperability [1] T.Y. Wah, H. Peng, and C.S. Hok, Building Data Warehouse, Proc.
24th South East Asia Regional Computer Conference, November 18-
90
19, 2007, Bangkok, Thailand
80
[2] Tho, M. Njuyen, Tjoa, A. Min; Zero-Latency Data Warehousing for
70 Heterogeneous Data Sources and Continuous Data Streams, Institute
60 of Software Technology and Interactive Systems Favoriteristr. 9-
11/188, 2003
50
[3] T. Jaorg, S. Dessloch, Near Real-Time Data Warehousing Using
40 State-of-the-Art ETL Tools, University of Kaiserslautern, 67653
30 Kaiserslautern, Germany, 2009.
20 [4] Passionned, 'The BI Tool survey report, 2008.
10 [5] Passionned, ETL Tools survey report, 2009.
0 [6] J. Levin, ETL Tools Comparison, March 2008.
IBM IS I PC Talend OWB MS SSIS BO SAP SAS DIS Others [7] Dr. R. Chillar; B. Kochar; Extraction Transformation Loading A
OS Road to Data warehouse, 2nd National Conference Mathematical
Windows Linux Sun Solaris Techniques: Emerging Paradigms for Electronics and IT Industries
HP-UX IBM A/X IBM iSeries OS400
IBM zSeries MVS HP Tru64 Open VMS [8] Guide to Data Warehousing and Business Intelligence, available at
http://data-warehouses.net/architecture/etlprocess.html.
Figure 6. Interoperability [9] Pervasive Systems, Extraordinarily Flexible ETL
Platform,http://www.pervasiveintegration.com/scenarios/Pages/etl_to
From all the analysis conducted it is still hard to ols_data_aggregation.aspx.
generalize which tool is the best. Though Infomatica proves [10] Adeptia incorporation, ETL Vendors Comparison, available at
to be better in quite many features but MS SSIS and OWB http://www.adeptia.com/products/etl_vendor_comparison.htm
l.
have improved well overtime and now are in pace with the
[11] Guide to Data ware housing and Business Intelligence, Architectural
high contenders too. Overall it can bee seen when Overview, available at http://data-
considering pure ETL tools then IPC can be ranked as still warehouses.net/architecture/overview.html.
the market leader with IBM IS coming second along side [12] ETL tools Survey, available at http://www.etltool.com/what-is-
Talend OS. However when it comes to DB integrated Tools etl.htm.
then OWB and SSIS follow IPC directly. Thus one should be [13] Oracle Ware house builder 11g, A technical overview, at
careful in selecting the tool as it may not be the best for http://www.oracle.com/technology/products/warehouse/index.html.
organization just by the name of vendor. The capabilities of [14] ETL data ware house concepts, available at http://etl-
the tool should be reviewed before selection. information.blogspot.com/2007_07_01_archive.htm