Vous êtes sur la page 1sur 16

Javlin

Whitepaper
Series

Designing Data Applications


for Large Enterprises
The power of the data integration layer

Author
Branislav Repcek
Javlin
Designing Data Applications for Large Enterprises

Abstract
This paper presents a design methodology suitable for applications which
process large amounts of data in an enterprise environment. For a reliable
enterprise approach, it is very important applications be robust, scalable and
extensible. To ensure such requirements are met, we propose an improved
architecture based on a multi-tier application design commonly used for large
scale applications.

The key to this approach is an additional application layer which separates


business logic from the presentation and storage layers. By using standardized
tools and frameworks it is possible to develop large applications very quickly
and efficiently. Modern ETL frameworks allow us to design the business logic in
visual language which is easier to work with than the logic hard-coded in the
application’s backend in more traditional languages (e.g. PL/SQL, Java, or PHP).
The resulting application is then more flexible and allows us to quickly react to
typical data management issues caused by fast changing requirements for many
business rules in growing and unstructured environment.

The validity of this design approach is demonstrated on a real-world application


which has been designed and implemented according to the presented
architecture.

2 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

Data Management Challenges


Today, every company and enterprise is compelled to manage huge amounts of
data that impact their business on a daily basis – data that comes from different
systems and stored in a multitude of formats. Over time, not only do the data
formats and volumes grow, but the complexity of the business logic increases
dramatically. Here is where we begin to scope this major issue in data
management.

Many data processing applications start as a simple


spreadsheet with embedded macros or formulas. This As data volumes increase,
might be adequate for small companies and their simple management of both the data
business logic. However, as the data volumes increase,
and business logic becomes
this approach quickly becomes hard to manage and
expensive due to significant overhead and manual work.
unmanageable.
Further, sometimes the business logic or understanding
itself is dependent on the tool, not the other way around.

Surprisingly, such practices are quite common, even in medium or large


companies. This is often true for companies which integrate many different
customers with just slightly different data formats (e.g. different banks or
insurance companies who store very similar data.) Using unsuitable tools for
data management increases the administrative effort required to keep data up-
to-date and correct.

Usually after the data management becomes unbearable, companies seek


solutions with better integration, higher automation and easier maintenance
and extensibility. A very common approach is to build a custom application
(often web-based) using a database as data storage. Even though this some
companies experience a measure of success, the resulting applications are often
hard to maintain and extend since they are a unique and represent a non-
standard solution. Further, the business logic is often hard-coded which means
increased life cycle cost: the processes will surely change over time and the
applications will require re-writes. What this demands is an in-house
development team (or often, the one person who knows the code) to protect the
data architecture. What is proposed then is a more flexible and elegant solution

3 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

to the issue of managing data architectures for the future – a specific way of
designing such applications with extensibility and standardization in mind.

This is accomplished by adding an additional layer to the architecture of the


application – an independent data integration layer.

The Data Integration Layer


The data integration layer separates data and business logic from the actual
application which is used as a user interface to display the data. The basic
architecture of such an application consists of three distinctive layers:

 Thin client – either stand-alone or web-based application.


 Data integration (DI) layer.
 Data storage (any platform -often a database, file system or even an
enterprise service bus).

In this context, the DI layer is the only layer which contains the implementation
of the business logic. Its main purpose is to host all ETL components of the
system and provide clear separation between this ETL parts and the rest of the
application.

This separation means that all the data has to pass through the data integration
layer during processing. This offers many advantages such as database
independency, ability to access multiple different data sources, rapid
development, data exports in different formats and many more.

Since all the data operations pass through the data integration layer, you can use
all the advantages the software DI layer offers. Most notably, this can include the
positive elements of database independency, ability to access other data
sources, rapid development, data exports in different formats, to name a few.

DI Layer Implementation
Since the data integration layer becomes the core of the application, it is
important to carefully design how it interacts with the rest of the application

4 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

and which software package is used. Data management often involves


integration of multiple systems together to validate the data and can contain
business logic which directly operates on the processed data. This makes it a
perfect place for an off-the-shelf ETL (Extract-Transform-Load) package to
handle all the heavy data processing.

There are several key questions which need to be answered before the DI layer
software is selected and implementation can start:

 What is the complexity of the business logic behind the application?


 Is the application interactive or processing data based on a schedule?
 Which communication protocols are used by data sources and data
sinks?
 What level of fault tolerance and scalability is required?

The complexity of the business logic has a significant impact on the software
package which is used as DI layer core. Many data management applications can
be very simple – they basically convert one data format to another and may
perform some light processing.
However, it is often required to
validate the data against other
system (especially for manually
Thin client
entered data) or to run various
processes depending on the data. DI layer
These requirements place
additional constraints on the ETL
package which must support all the Any platform
intended business logic features so
that there is a minimum amount of Figure 1: Basic application architecture with
code which is outside of the ETL data integration layer.
core.

Interactive applications usually require more complex design as the users


expect the application to behave well even if an error occurred. Applications
running in a batch scheduled mode can simply send an email to the operator or
to a ticketing system, but such behavior is probably not acceptable to users who

5 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

expect to see direct feedback in their client. Interactivity also requires different
business logic design.

The communication with other components in the system might place


additional constraints on the ETL tool. However, with modern tools this
becomes less of an issue as they typically support many different protocols. It is
often quite easy to access databases, FTP or HTTP servers or use more complex
communication protocols like MQ.

Fault tolerance and scalability requirements might not directly impact the
overall design, but they certainly impact the cost of the whole solution. Most
ETL packages offer higher-tier versions with support for clustering, load
balancing and so on, but these versions often cost significantly more than simple
versions. It is often very easy to migrate to higher version if required, and
therefore, it might be possible to start small and expand the application as the
need arises.

Advantages
Using the DI layer in an application offers multiple advantages with business
logic separation being the most prominent.

Business Logic Separation

In many applications, business operations can


become very complex due to many different data Separating the layer which
formats (e.g. customer database, orders, payments, runs business logic offers a
etc.) and multiple validation and data cleansing very clear advantage.
steps.

Separating the layer which runs business logic is a


must: it offers a clear advantage in better maintainability and extensibility.
Depending on the tool chain used the deployments of updated or new business
rules can be very simple, even on multiple servers. This is where the efficiency
of the DI layer saves time and money in the development life cycle of a
enterprise business operations.

6 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

Rapid Development

State-of-the-art data integration platforms offer a user-friendly client


application with visual interface for data transformation authoring. This greatly
assists in rapid development needs of the enterprise.

Since the main job of data integration software is to transform data into useable
workflow supporting operations, the expressive power of built-in languages or
(visual) components tends to be well suited for designing business rules. It is
often much easier to prepare transformation in such application than it is to
write them in programming languages like SQL, C#, Python etc.

Such separation from low-level coding often allows business analysts to work
directly with the business logic, therefore allowing for rapid prototyping of
future rules and enhancements with a very quick result turnaround. Seeing the
transformation in action is often just a click of a button.

Data Access

Data integration platforms such as CloverETL offer several connectors or


components which allow reading and writing the data in different formats. This
often includes connectors to different database engines (e.g. Oracle, MySQL,
MSSQL); file formatters (simple CSV, MS Excel, XML) and different file transfer
protocols (HTTP, FTP, or local file system).

In addition, it is often the case a single company can use dozens of different file
formats or data structures just within a single application. The ability to write
the core transformations independently from the data input or output method is
crucial as it saves considerable development resources and allows for cleaner
design and easier integration within existing pipeline.

Finally, hiding platform specific details from the business logic itself allows for
easier future-proofing of an application – migration, data format changes or
platform updates are all often just a configuration change.

7 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

Data Quality

Data quality and consistency is one of the most


important things for any business-critical data. With
Data quality is becoming
increasing complexity of the business rules and the one of the most critical
systems which process the data, any errors in the input aspects affecting business
can have significant impact on the revenue and critical data.
operation costs.

Modern data integration platforms offer wide range of data validation and
cleansing tools. The offerings range from trivial validation (date or number
format); through referential integrity validation to complex tools offering, for
example, address validation or WebService integration.

Enterprise Features

Data integration systems often offer full set of enterprise features such as
automation, workflow, logging, or monitoring.

For high availability and high performance, many tools integrate with fail-over
systems and are able to run in clusters with load-balancing. Some of the tools
offer even more advanced features like distributed processing or auto-scaling.

Through different connectors or components it is possible to connect to an ESB


(Enterprise Software Bus) and integrate the whole application into a bigger
enterprise infrastructure.

Considerations
As with other solutions, this architecture presents some considerations before
deciding whether the architecture is suitable for a given enterprise architecture.

The first consideration is the cost of the data integration tools. Historically,
larger enterprise solutions cost hundreds of thousands of dollars, both in license
and maintenance fees. With the advent of lower cost and robust solutions, this
has been mitigated, but life cycle cost is clear a consideration.

8 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

The price needs to be specifically considered when thinking about the future of
the application. Most applications will slowly grow with their data volumes.
Data Integration tools often offer several tiers with lower tiers suitable for
smaller projects and higher tiers with more features for large-scale projects.
This allows every customer to select what is best and upgrade when necessary.
Higher initial cost will usually pay off quickly as the run costs are quite stable
even though data volumes increase.

Another second consideration is the requirement to support new tools which


may not be deployed elsewhere in the company. This means new support
personnel must be trained and new guidelines for tool deployments will have to
be designed. This may translate to higher initial project investment. A
calculation on return on investment (ROI) must be completed for decision based
analysis. One key point is that one-time only costs after the initial phase of the
project may be offset by reduced development time for new business logic,
improved data quality and stability. This is especially true for applications
which require very complex business logic which can quickly change with
market needs for enterprise data driven applications.

Real-life Example
We will demonstrate the usability of the proposed architecture on a data
management application built for a financial services customer.

The application was designed to replace MS Excel based workflow – different


work groups in multiple companies communicated by using worksheets with
pre-defined structure. Due to the number of customers and financial
transactions, this was proving to be very cumbersome and generated quite large
number of errors (simple typos or copy-paste errors between different
documents).

Moreover, some of the documents had quite complicated layouts (more than
200 columns) and had thousands of rows. The tasks performed by the
employees handling these documents ranged from simple data pairing
(customer and transactions) to more complicated financial calculations and
estimates.

9 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

Requirements and Architecture

Several key requirements have


been set by our customer. The Users Directory

application required user


interface, since manual updates of
the master data were required. All Web GUI
changes made by the employees
had to be audited to make sure
CloverETL
that any errors can be quickly Server
traced back and corrected.
IBM DB2
Apart from allowing the users to
edit the data, the resulting
application needed to apply
Figure 2: Overall application architecture.
various business rules – for Most components run on AS/400 with users
example create the calendar for accessing only the web server directly and
never accessing the database. Some of the
insurance payments, update
outputs are sent to NFS share on different
various invoices and so on. The server where users can directly see the files
resulting business logic was quite via shared Windows directory. Red arrows
indicate flow of the business data; blue arrows
complex and therefore we needed indicate flow of the data related to the
a tool which would be able to application itself (e.g. log-in verification, inter-
support such operations. component communication etc.)

An additional request was that the application had to support loading and
exporting the data in MS Excel format. This was required since not all
companies had been able to switch to a different transport format. Therefore it
was required that the business logic would use the database internally and
import and export would be allowed in MS Excel formats as well.

The above requirements are a perfect fit for Clover ETL Server which has been
selected as the core of the DI layer of this application. Since Clover ETL runs on
all reasonable implementations of Java, it is able to support IBM AS/400 which
was the platform preferred by our customer. The resulting high-level
architecture diagram of the whole application is presented in Figure 2.

10 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

Platform Independence

During the development it proved beneficial to have the ability to quickly deploy
and test new versions of the application. Since we had few free virtual machines
running Windows and Linux guest systems, we have decided to use some of
them for the development.

The flexibility of CloverETL Server allowed us to quickly deploy onto any


platform without any substantial configuration. This allowed us to start the
development even before customer was able to provide dedicated AS/400 test
environment for this project.

When the server was ready, migration of the graphs was as simple as copying
files over and adjusting the configuration for the new environment.

Business Logic

Keeping with the philosophy of a


separate DI layer, whole business
logic is implemented as a set of
CloverETL transformations. The
requirements set by our
customer have resulted in very
complex business with more than
40 transformation graphs with
quite a bit of code in custom
transformations.
Figure 3: Example input (top) and output
Most of the graphs work together (bottom) transformation graphs which
in pairs – one input and one together define part of the business logic for
one data type. Green boxes are reader
output graph (see Figure 3). Two components (e.g. file or database input); blue
or more of such pairs are usually are writer components while dark yellow
grouped together into bigger boxes are various transformations (e.g. sort,
join, custom transformations). Data flow is
groups which usually contains a represented by lines connecting various
graph pair for historical data components and by convention it is from left
processing and a pair for change to right.

processing.

11 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

Since everything is implemented as a graph, it was trivial to change various


parts of business logic while the application was in development.

The only configuration required for the business logic of the application consists
of mapping between graphs and data types. This makes it easy to deploy new
data types since only a configuration update in the database and in the
CloverETL Server is required.

The input and output components even allowed us to quickly switch input and
output data formats during the development. This was especially useful since
the complete database layout for data input had not been finalized before the
development started, but had been prepared by our customer over the course of
the development. Therefore, we tested first versions of the graphs by using MS
Excel files as input and output. When database was ready, we simply switched
input/output from MS Excel to database components and tweaked graphs for
better performance (since it is possible to run complex queries in DB while the
same is not possible in MS Excel worksheets).

Communication between Components

As can be seen from Figure 2, there are multiple interfaces between different
components in the application.

Since CloverETL Server is a Java application, it is natural to use JDBC for


communication between database and CloverETL. This choice allows for
relatively painless migration to a different database provider – simple driver
change in the configuration is usually enough.

Communication between users and web server is done through HTTP protocol.
Web front-end itself is written in PHP with AJAX parts for some additional
functionality.

The output which is sent from CloverETL Server to the directory with XLS files is
simple disk-based access – the target directory is simply mounted on the
AS/400 server.

12 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

The most interesting interface is between the web server and CloverETL Server.
There are few requirements such an interface must meet before it can be used:

 It needs to be easily accessible from web applications.


 It needs to be asynchronous to allow the user to only send request for
data processing without having to actively wait for it to finish.
 It needs to allow sending of parameters to the graphs to control the
data transformations.
 It should provide a way of sending back the information about the
status of the transformation (whether it succeeded or failed).
 It should provide a way of sending back the output of the
transformation.

To fulfill the above requirements, CloverETL Server


implements Launch Service which provides an
The most interesting
easy-to-integrate interface designed specifically for interface is the one between
use in web-based applications. The communication the web server and the
protocol is based on standard HTTP which is the CloverETL Server.
norm for communication between web
applications. Each request including its parameters
can be encoded as a URL which is decoded by the
CloverETL Server. The response is sent back again via HTTP protocol and can
contain any data – simple CSV, XML or even XLS document if required. Since the
data is sent via HTTP, the calling web application can decide what to do with the
response – it can be processed further or a file save dialog can be shown to the
user.

Web Interface

The basic design of the web interface for our application was quite
straightforward. Since its main use is maintenance of the master data, the main
part of the user interface is a table which represents the data in the database.

Of course, to make life of the end user easier, the application used AJAX and
extensive scripting to make it as comfortable as possible.

Since our application required interactive data processing (for importing and
exporting of the data), it had to be able to talk to Clover ETL Server as well as to

13 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

the database which stored the configuration, user accounts and the data it was
displaying.

To connect to Clover ETL Server we have used Launch Services as described


above. All the graphs, which run in interactive mode, need only few parameters
which were easy to pass in URL.

To connect to the database we have used the functions provided directly by


PHP. Since the application did not need to execute any complex queries, this did
not prove to be much of a problem. The only problematic part was caused by
incompatibilities between SQL syntax in DB2 on AS/400 and Linux version. This
has been resolved by using a library which allowed us to switch the target
database via a simple configuration change.

Performance

For any application which allows users to directly work with the data and run
transformations in interactive mode it is very important that these
transformations run as fast as possible to minimize the time users have to wait
for the operation to finish. Since the data sets we worked with were relatively
large, it was important to keep the graphs as fast as possible. This has been
accomplished by designing the graphs so that they do not perform any
unnecessary operations – e.g. only sort when necessary, keep queries into the
database to minimum to minimize latency issues and so on.

In the end, data transfers between the database and Clover ETL as well as the
browsers turned out to be a bottle neck for the application. For users the most
visible is the performance of the client application. Since the data layouts are
quite complex (up to 250 columns), the resulting page with the data can be
several megabytes even when only hundred or so lines are displayed. This could
be mitigated a bit by switching to a faster browser and with improvements in
the browser technology. Over time we expect this to become less of a problem in
future when our clients migrate to newer versions.

14 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

Conclusion
We have presented an improved multi-tier application architecture with a data
integration layer. Such architecture is suitable for applications which process
large amounts of data in different formats.

We proved the practicality of such architecture by implementing and deploying


data management application for our customer. We show that the architecture
allows for rapid application development and with right core software choices
the whole application can be easily ported to a different platform.

The separation of business logic from the presentation layer and from the data
storage provides a flexibility which would be hard to achieve if the logic was
coded directly in the front-end or back-end as is usual for many similar
applications. Even if an ETL tool is not used to implement the business logic,
such separation is beneficial, since it allows for faster development, testing and
deployment of the application as more teams can work in parallel.

15 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011


Designing Data Applications for Large Enterprises

Javlin
Javlin is a premier provider of data integration software
and solutions. Its leading software platform, CloverETL,
provides users the ability to manage data solutions such
as integration, migration, cleansing, audit,
synchronization, consolidation, Master Data
Management and Data Warehousing. In addition to
development of data integration products, Javlin offers
software solutions, custom software development and
data integration consulting services.

www.javlininc.com

www.javlin.eu

CloverETL
CloverETL software is platform independent and scalable
with a smooth upgrade path. It is also easily embeddable
thanks to its small footprint. The CloverETL OEM
foundation program also provides a way to embed ETL in
applications.

www.cloveretl.com

16 Javlin Whitepaper Series | www.javlininc.com | www.cloveretl.com | Copyright Javlin 2011

Vous aimerez peut-être aussi