Académique Documents
Professionnel Documents
Culture Documents
Whitepaper
Series
Author
Branislav Repcek
Javlin
Designing Data Applications for Large Enterprises
Abstract
This paper presents a design methodology suitable for applications which
process large amounts of data in an enterprise environment. For a reliable
enterprise approach, it is very important applications be robust, scalable and
extensible. To ensure such requirements are met, we propose an improved
architecture based on a multi-tier application design commonly used for large
scale applications.
to the issue of managing data architectures for the future – a specific way of
designing such applications with extensibility and standardization in mind.
In this context, the DI layer is the only layer which contains the implementation
of the business logic. Its main purpose is to host all ETL components of the
system and provide clear separation between this ETL parts and the rest of the
application.
This separation means that all the data has to pass through the data integration
layer during processing. This offers many advantages such as database
independency, ability to access multiple different data sources, rapid
development, data exports in different formats and many more.
Since all the data operations pass through the data integration layer, you can use
all the advantages the software DI layer offers. Most notably, this can include the
positive elements of database independency, ability to access other data
sources, rapid development, data exports in different formats, to name a few.
DI Layer Implementation
Since the data integration layer becomes the core of the application, it is
important to carefully design how it interacts with the rest of the application
There are several key questions which need to be answered before the DI layer
software is selected and implementation can start:
The complexity of the business logic has a significant impact on the software
package which is used as DI layer core. Many data management applications can
be very simple – they basically convert one data format to another and may
perform some light processing.
However, it is often required to
validate the data against other
system (especially for manually
Thin client
entered data) or to run various
processes depending on the data. DI layer
These requirements place
additional constraints on the ETL
package which must support all the Any platform
intended business logic features so
that there is a minimum amount of Figure 1: Basic application architecture with
code which is outside of the ETL data integration layer.
core.
expect to see direct feedback in their client. Interactivity also requires different
business logic design.
Fault tolerance and scalability requirements might not directly impact the
overall design, but they certainly impact the cost of the whole solution. Most
ETL packages offer higher-tier versions with support for clustering, load
balancing and so on, but these versions often cost significantly more than simple
versions. It is often very easy to migrate to higher version if required, and
therefore, it might be possible to start small and expand the application as the
need arises.
Advantages
Using the DI layer in an application offers multiple advantages with business
logic separation being the most prominent.
Rapid Development
Since the main job of data integration software is to transform data into useable
workflow supporting operations, the expressive power of built-in languages or
(visual) components tends to be well suited for designing business rules. It is
often much easier to prepare transformation in such application than it is to
write them in programming languages like SQL, C#, Python etc.
Such separation from low-level coding often allows business analysts to work
directly with the business logic, therefore allowing for rapid prototyping of
future rules and enhancements with a very quick result turnaround. Seeing the
transformation in action is often just a click of a button.
Data Access
In addition, it is often the case a single company can use dozens of different file
formats or data structures just within a single application. The ability to write
the core transformations independently from the data input or output method is
crucial as it saves considerable development resources and allows for cleaner
design and easier integration within existing pipeline.
Finally, hiding platform specific details from the business logic itself allows for
easier future-proofing of an application – migration, data format changes or
platform updates are all often just a configuration change.
Data Quality
Modern data integration platforms offer wide range of data validation and
cleansing tools. The offerings range from trivial validation (date or number
format); through referential integrity validation to complex tools offering, for
example, address validation or WebService integration.
Enterprise Features
Data integration systems often offer full set of enterprise features such as
automation, workflow, logging, or monitoring.
For high availability and high performance, many tools integrate with fail-over
systems and are able to run in clusters with load-balancing. Some of the tools
offer even more advanced features like distributed processing or auto-scaling.
Considerations
As with other solutions, this architecture presents some considerations before
deciding whether the architecture is suitable for a given enterprise architecture.
The first consideration is the cost of the data integration tools. Historically,
larger enterprise solutions cost hundreds of thousands of dollars, both in license
and maintenance fees. With the advent of lower cost and robust solutions, this
has been mitigated, but life cycle cost is clear a consideration.
The price needs to be specifically considered when thinking about the future of
the application. Most applications will slowly grow with their data volumes.
Data Integration tools often offer several tiers with lower tiers suitable for
smaller projects and higher tiers with more features for large-scale projects.
This allows every customer to select what is best and upgrade when necessary.
Higher initial cost will usually pay off quickly as the run costs are quite stable
even though data volumes increase.
Real-life Example
We will demonstrate the usability of the proposed architecture on a data
management application built for a financial services customer.
Moreover, some of the documents had quite complicated layouts (more than
200 columns) and had thousands of rows. The tasks performed by the
employees handling these documents ranged from simple data pairing
(customer and transactions) to more complicated financial calculations and
estimates.
An additional request was that the application had to support loading and
exporting the data in MS Excel format. This was required since not all
companies had been able to switch to a different transport format. Therefore it
was required that the business logic would use the database internally and
import and export would be allowed in MS Excel formats as well.
The above requirements are a perfect fit for Clover ETL Server which has been
selected as the core of the DI layer of this application. Since Clover ETL runs on
all reasonable implementations of Java, it is able to support IBM AS/400 which
was the platform preferred by our customer. The resulting high-level
architecture diagram of the whole application is presented in Figure 2.
Platform Independence
During the development it proved beneficial to have the ability to quickly deploy
and test new versions of the application. Since we had few free virtual machines
running Windows and Linux guest systems, we have decided to use some of
them for the development.
When the server was ready, migration of the graphs was as simple as copying
files over and adjusting the configuration for the new environment.
Business Logic
processing.
The only configuration required for the business logic of the application consists
of mapping between graphs and data types. This makes it easy to deploy new
data types since only a configuration update in the database and in the
CloverETL Server is required.
The input and output components even allowed us to quickly switch input and
output data formats during the development. This was especially useful since
the complete database layout for data input had not been finalized before the
development started, but had been prepared by our customer over the course of
the development. Therefore, we tested first versions of the graphs by using MS
Excel files as input and output. When database was ready, we simply switched
input/output from MS Excel to database components and tweaked graphs for
better performance (since it is possible to run complex queries in DB while the
same is not possible in MS Excel worksheets).
As can be seen from Figure 2, there are multiple interfaces between different
components in the application.
Communication between users and web server is done through HTTP protocol.
Web front-end itself is written in PHP with AJAX parts for some additional
functionality.
The output which is sent from CloverETL Server to the directory with XLS files is
simple disk-based access – the target directory is simply mounted on the
AS/400 server.
The most interesting interface is between the web server and CloverETL Server.
There are few requirements such an interface must meet before it can be used:
Web Interface
The basic design of the web interface for our application was quite
straightforward. Since its main use is maintenance of the master data, the main
part of the user interface is a table which represents the data in the database.
Of course, to make life of the end user easier, the application used AJAX and
extensive scripting to make it as comfortable as possible.
Since our application required interactive data processing (for importing and
exporting of the data), it had to be able to talk to Clover ETL Server as well as to
the database which stored the configuration, user accounts and the data it was
displaying.
Performance
For any application which allows users to directly work with the data and run
transformations in interactive mode it is very important that these
transformations run as fast as possible to minimize the time users have to wait
for the operation to finish. Since the data sets we worked with were relatively
large, it was important to keep the graphs as fast as possible. This has been
accomplished by designing the graphs so that they do not perform any
unnecessary operations – e.g. only sort when necessary, keep queries into the
database to minimum to minimize latency issues and so on.
In the end, data transfers between the database and Clover ETL as well as the
browsers turned out to be a bottle neck for the application. For users the most
visible is the performance of the client application. Since the data layouts are
quite complex (up to 250 columns), the resulting page with the data can be
several megabytes even when only hundred or so lines are displayed. This could
be mitigated a bit by switching to a faster browser and with improvements in
the browser technology. Over time we expect this to become less of a problem in
future when our clients migrate to newer versions.
Conclusion
We have presented an improved multi-tier application architecture with a data
integration layer. Such architecture is suitable for applications which process
large amounts of data in different formats.
The separation of business logic from the presentation layer and from the data
storage provides a flexibility which would be hard to achieve if the logic was
coded directly in the front-end or back-end as is usual for many similar
applications. Even if an ETL tool is not used to implement the business logic,
such separation is beneficial, since it allows for faster development, testing and
deployment of the application as more teams can work in parallel.
Javlin
Javlin is a premier provider of data integration software
and solutions. Its leading software platform, CloverETL,
provides users the ability to manage data solutions such
as integration, migration, cleansing, audit,
synchronization, consolidation, Master Data
Management and Data Warehousing. In addition to
development of data integration products, Javlin offers
software solutions, custom software development and
data integration consulting services.
www.javlininc.com
www.javlin.eu
CloverETL
CloverETL software is platform independent and scalable
with a smooth upgrade path. It is also easily embeddable
thanks to its small footprint. The CloverETL OEM
foundation program also provides a way to embed ETL in
applications.
www.cloveretl.com