Académique Documents
Professionnel Documents
Culture Documents
is
Application level: Allows simultaneous access and timely insights for all
your users across all your data irrespective of the processing engine. This
is possible because Hadoop 2.x allows you to store data first and query it in
the moment or later in a flexible fashion.
Infrastructure level: Allows you to acquire all data in its original format and
store it in one place, cost effectively and for an unlimited time. This is
possible because Hadoop 2.x delivers 20x cheaper data storage than
alternatives.
This step change in effectiveness and efficiency allows you to extract maximum
business value from the rapid growth in data volume, variety and velocity.
Business Drivers for a Data Lake
We posit that deploying Hadoop 2.x as an enterprise-wide shared service is the
best way of turning data into profit. We call this shared service a data lake.
The value created by Hadoop 2.x grows exponentially as data from more
applications lands in the data lake. More and more of that data will be retained
for decades. For many enterprises, data becomes possibly as important as
capital and talent in the quest for profit. Therefore, it is important to future-proof
your investments in big data. Even your first Hadoop 2.x project should consider
the data lake as the target architecture.
In this white paper, you will learn about the technology that makes the data lake
a reality in your environment. We will introduce the modular target architecture
for the data lake, detail its functional requirements and highlight how the
enterprise-grade capabilities of Hadoop 2.x deliver on these expectations. As
more use cases join the data lake, more of the enterprise-grade functionality of
that the second generation of Hadoop brings comes into focus.
1
Apache, Apache Hadoop and any Apache projects are trademarks of the Apache Software
Foundation.
PAGE 2
PAGE 3
Many enterprises are making Hadoop 2.x available throughout the organization
in a very thoughtful manner to avoid the fragmentation of the legacy world.
Instead of deploying one Hadoop 2.x cluster per project, they are moving to
Hadoop 2.x as an enterprise-wide shared service for all projects, for all analytic
workloads and for all business units. Clearly it does not make sense for an
enterprise to have one cluster per team, or one cluster for each of Storm, Hbase,
MapReduce, Lucene. In other words, these customers are avoiding data ponds
and moving toward a unified data lake.
The value created by Hadoop 2.x grows exponentially as data from more
applications lands in the data lake. More and more of that data will be retained
for decades. For many enterprises, data becomes possibly as important as
capital and talent in the quest for profit. Therefore, it is important to future-proof
your investments in big data. Even your first Hadoop 2.x project should consider
the data lake as the target architecture.
In this white paper, you will learn about the technology that makes the data lake
a reality in your environment. We will introduce the modular target architecture
for the data lake and detail its functional requirements
PAGE 4
Application level: Allows simultaneous access and timely insights for all
your users across all your data irrespective of the processing engine.
Infrastructure level: Allows you to acquire all data in its original format and
store it in one place, cost effectively and for an unlimited time, subject only
to constraints imposed by your compliance policies.
PAGE 5
PAGE 6
PAGE 7
PAGE 8
PAGE 9
PAGE 10
BENEFITS
Hadoop 2.x is the foundation for the big data and analytics revolution that is
unfolding because of its superior economics and transformational capabilities
compared to legacy solutions. Enterprises moving away from fragmented data
storage and analytics silos stand to realize numerous benefits:
Deeper insights. Larger questions can be asked because all of the data is
available for analysis, including all time periods, all data formats and from all
sources. And semantic schemas are only defined at the time of analysis. The
combinatorial effect of analyzing all data produces a much deeper level of insight
than what is available in the stove-piped legacy world. Having a single 360degree view of the customer relationship across time, across interaction
channels, products and organizational boundaries is a perfect example.
Actionable insights. Hadoop 2.x enables closed-loop analytics by bringing the
time to insight as close to real time as possible.
The data lake, as a shared service across the organization, also brings benefits
that are very similar to a private infrastructure cloud:
Agile provisioning and de-provisioning of both capacity and users. With
the data lake, your business users can be up and running in minutes. Simply tap
into the pool of capacity available, instead of setting up a dedicated Hadoop
cluster for every project. Moreover, should a project no longer be required, the
capex is not stranded since the physical assets can be repurposed. This means
no minimum investment is required to get started. Likewise, on-ramping a new
user is a matter of minutes both operationally and skills-wise. The single
infrastructure pool works seamlessly with a range of BI and analytics tools. For a
new user, this means they may already have different analytic lenses configured
for them the moment they log in for the first time. And they can use Microsoft
Excel as a familiar tool, simply selecting Hadoop as the data source.
Faster learning curve and reduced operational complexity. Hadoop is a new
topic to many IT professionals. Many organizations have found it beneficial to
create a Hadoop center of excellence in the organization, forming either a central
pool of expertise or a coordinated set of teams that can easily learn from each
other. A data lake provides the natural setting for this to happen. Having
dedicated technical and organizational Hadoop silos for every project would
PAGE 11
PAGE 12