Vous êtes sur la page 1sur 13

Seize the Data With SAP

An ever-growing assortment of business intelligence and data warehousing offerings


from SAPplus the newest swarm of industry buzzwordscan confound organizations
exploring ways to house data and mine it for insight. BY ETHAN JEWETT

EDITORS NOTE

CONFUSED BY
BIG DATA HYPE?
BUZZWORDS
DONT HELP

CUT THROUGH
THE SAP-HADOOP
FOG

BENEATH THE
SURFACE OF DATA
TRANSFORMATION

EDITORS
NOTE

HOME
EDITORS NOTE
CONFUSED BY BIG
DATA HYPE?
BUZZWORDS DONT
HELP
CUT THROUGH
THE SAP-HADOOP
FOG
BENEATH THE
SURFACE OF DATA
TRANSFORMATION

A Discipline, Not a Technology

SAP has so many approaches to business


intelligence and data warehousing that one can
be forgiven for not grasping them all. With
the roster evolving annually, some overarching
framework of understanding is needed. Thats
what consultant Ethan Jewett, an SAP Mentor who specializes in BI and data management
issues, offers in this three-part guide.
Elsewhere on SearchSAP, Jewett has written that the practical methods involve either
the data warehouse modeling and management
application, Business Warehouse, or a mix of
SAP and third-party tools. But data warehousing is really a discipline for integrating and
managing data over time.
First, he puts the biggest buzzwords in data
management in perspective and shows why
plain-English words like honesty, integrity and
transparency are better signposts. Honesty, for
example, means telling the truth about the
accuracy of data so users know how reliable it

SEIZE THE DATA WITH SAP

is for prediction and other analysis.


Next, Jewett addresses Hadoop, todays most
talked-about technology for big dataand the
target of many BI projects. He lays out how well
SAP products such as HANA and Data Services
integrate with Hadoops various parts.
The guide closes by examining the analytics and visualization software through which
BI users interact with data. Its impossible to
understand these user-interface tools without
knowing something about the data transformation that products including BW use to first
filter and organize the data.
Jewett says separating the layers invites
trouble by masking data-integrity issues, and
he argues for a seamless, integrated approach
that gives users the power to fix problems on
the spot.
David Essex
Executive Editor, SearchSAP

ANALYTICS

Confused By Big Data Hype? Buzzwords Dont Help

HOME
EDITORS NOTE
CONFUSED BY BIG
DATA HYPE?
BUZZWORDS DONT
HELP
CUT THROUGH
THE SAP-HADOOP
FOG
BENEATH THE
SURFACE OF DATA
TRANSFORMATION

As anyone whos spent any time in IT


knows, buzzwords are big. And nowhere are
they bigger than indata management. The hype
swirling around big data, for exampledriven
by declarations like Data scientistsrunning
real-time, in-memory predictive analytics
on big data will surely be a game changer for
your business!are commonplace in todays
market.
In reality, the products or services behind
these data buzzwords may disappoint. Gleaning
important insights from your data is a difficult,
labor-intensive and often tedious process.
This isnt to say that all vendorspeak is
meaningless, but it can be tough to tell the
difference between genuine technical terms
and cant. The first is a marker of expertise
while the latter is indicative of sloppy thinking. I dont consider myself an expert in statistics ordata science, but Ive learned enough
about the concepts and techniques behind the

SEIZE THE DATA WITH SAP

buzzwords to know that each comes with its


own tradeoffs and pitfalls. If were not clear on
what these techniques are when embarking on
data-driven projects, we run the risk of project
failure.

FALSE PREDICTIONS

For example, what we often refer to as predictive analytics are algorithms that find potential correlations and trends in data. Predictive
algorithms arent actually predictive; at best,
they tell you what will probably happen if the
future is like the past. At worst, they are highly
susceptible tofalse positives, in which correlations that dont actually exist are mistakenly
identified.
False correlations appear because of the
random distribution of the data or through
an error in the analysis method. They dont
indicate a real-world phenomenon. Tools that

ANALYTICS

HOME
EDITORS NOTE
CONFUSED BY BIG
DATA HYPE?
BUZZWORDS DONT
HELP
CUT THROUGH
THE SAP-HADOOP
FOG
BENEATH THE
SURFACE OF DATA
TRANSFORMATION

make predictive algorithms accessible and


easier to run can exacerbate the false positive
problem because the more analyses are run,
the more likely random error will result in the
appearance of a correlation. Untrained operatorsand even trained operatorshave a tendency to forget about the analyses that didnt
find significant correlation in preference to the
analyses that got a result. But finding correlations is inevitable when running enough analyses. Trueanalytics softwareshould be smart
enough to recognize this.
Worse, error is rarely random. There is
always a process by which the data was gathered and consolidated. At every step in that
process there is the opportunity to introduce
errors in the data. These errors will tend to
introduce false correlations. For example, you
might do an analysis on profitability data, but
that data is missing sales figures for several
products from the western U.S. because of a
bug that was introduced to the system earlier
in the year. Yourpredictive analyticssoftware
will show that the regions profit contribution
has been going downhill and will probably continue down that path. In reality, your analysis

SEIZE THE DATA WITH SAP

is missing sales data but including cost and


overhead data for these products. Your software might make it look like you should cut
overhead to compensate when it should really
remind you to check your data or suggest that
the software deployment introducing the bug
seems to be correlated with the change in performance for the region and disappearance of
revenue for several products. Software vendors
havent included this kind of analysis functionality in their software because its very hard to
engineer and it doesnt address the buzzwords
that are driving software industry sales.

UNDERSTANDING THE VALUE OF DATA

On that note, Id like to introduce a few terms


of my own that could help unlock the potential
of data:

Honesty:Always showing thedataas it is to


the best of our ability. For example, showing error bars on our charts and making sure
that we dont imply a level of accuracy that
doesnt exist, both in the data and in visualizations based on our data. The predictive

ANALYTICS

software mentioned above might give the


impression that the data is reliable when
its not. This software would fail an honesty
test.

HOME
EDITORS NOTE
CONFUSED BY BIG
DATA HYPE?
BUZZWORDS DONT
HELP
CUT THROUGH
THE SAP-HADOOP

I ntegrity:Making sure that our data directly


reflects reality. This means expending effort
to avoid a situation in which the measurement, collection and preparation methods
we use introduce their own trends into our
data. The missing stock keeping unit example
above shows a lack of integrity in our data
preparation.

FOG

BENEATH THE
SURFACE OF DATA
TRANSFORMATION

 ransparency:Ensuring the honesty and


T
integrity of our data. Ideally, when a person is
looking at any data, the details of every step
of the processfrom measurement to collection and aggregation to visualizationshould
be available so that the viewer can assess the
quality of the data. For example, the analytics
software mentioned above that shows a profit
margin trend line for the western U.S. should
also show information about the source of

SEIZE THE DATA WITH SAP

the data, which might lead an operator to


notice that the beginning of the downward
trend correlated with the introduction of a
new software deployment.
This kind of transparency requires maintaining meaningful data lineage information
and making that information directly available in the analytic context.
The bar Im setting here is high, perhaps, but
heres the takeaway: The persistent and most
common problem in the data management
business isnt handling size, providing speed or
automatically predicting the future. The problem is gettingqualitydata in front of experts
in anhonest, transparentformat that provides
good interactions with the data and helps them
draw their own conclusions with confidence.
Buzzwords like big data, real time, in-memory
and predictive analytics dont provide business
value on their own, but in the service of honesty, integrity and transparency they can make
a major contribution to the value of our business data. n

INTEGRATION

Cut Through the SAP-Hadoop Fog

HOME
EDITORS NOTE
CONFUSED BY BIG
DATA HYPE?
BUZZWORDS DONT
HELP
CUT THROUGH
THE SAP-HADOOP
FOG
BENEATH THE
SURFACE OF DATA
TRANSFORMATION

Hadoop is hot. But what is Hadoop? Its


an umbrella project under TheApache Software
Foundationthat includes several core tools
for handling data processing on large computing clusters. There is also a large ecosystem of
related tools around the core Hadoop project,
and there are multiple Hadoop distributions
from companies likeCloudera,Hortonworks,
IBM, Intel andMapR. Each distribution offers
some combination of the core tools, ecosystem
tools and, often, proprietary replacements for
other pieces of the Hadoop pie that the distribution packager considers better in some way.
There is no one tool or set of tools called
Hadoop, so it is wise to react cautiously when
vendors claim to offer Hadoop integration.
The vendor may integrate with a single tool
in the Hadoop core or ecosystemor with
several or with none at all. SAPs integration
with Hadoop suffers from this confusion as
much as any vendor, so I thought it would be

SEIZE THE DATA WITH SAP

worthwhile to dig into exactly how SAPs software integrates with the various Hadoop tools.
First, lets define Hadoop. It includes a few
core tools. Those are:

Hadoop Distributed File System (HDFS), a


distributed file system that can run on a large
cluster of computers to storehuge amounts
of data. Other Hadoop tools tend to be set up
to usedatastored on HDFS.
YARN (Yet Another Resource Negotiator)
is the core cluster resource management
framework. Most Hadoop ecosystem tools
run on a YARN cluster.
MapReduceis a system for doing parallel
processing of large data sets; its on aGoogle
research paperfrom 2004. This was the
original Hadoop, but few vendors that offer
Hadoop integration use MapReduce directly.

INTEGRATION

DIFFERENCES IN INTEGRATION

HOME
EDITORS NOTE

Hadoop also has a massive ecosystem of tools


built around or on top of these core tools. Some
ecosystem projects are also hosted at Apache.
Others live elsewhere. The following are a few
key projects hosted at the ASF:

CONFUSED BY BIG
DATA HYPE?
BUZZWORDS DONT
HELP
CUT THROUGH
THE SAP-HADOOP
FOG
BENEATH THE
SURFACE OF DATA

Hive: Billed as the Hadoopdata warehouse,


Hive is a distributed database with a data
definition and query languagecalled HQL
that is similar to standard SQL. Hive tables
can be managed by Hive, or they can be defined as external tables on top of files on
HDFS, HBase and many other data sources.
In this way, Hive is often a gateway to data
stored in Hadoop ecosystem tools.

TRANSFORMATION

Pig: A language and execution platform for


creating data analysis programs.
HBase: A massively parallel, short-request
database, originally modeled onGoogles BigTable research paper.
Other projects include Spark (in-memory
cluster computing and streaming framework),

SEIZE THE DATA WITH SAP

Shark (Hive on Spark), Mahout (analytics


algorithms library), ZooKeeper (a centralized
service for maintaining information on configuration and other factors) and Cassandra
(similar to HBase).
So how do SAPs products integrate with
Hadoop tools? At the moment, SAP offers what
it callsHadoop integration in SAP HANA,
Sybase IQ,SAP Data Services andSAP BusinessObjects Business Intelligence (BI). Each of
these integrates with Hadoop tools differently.
SAP HANAand Sybase IQ both support
forwarding queries and other operations to
a remote Apache Hive system as if the Hive
tables were local tables. In Sybase IQ, this
setup is called a remote database and in
HANA the setup is through theSmart Data
Access mechanism. IQ also supports a type
of user-defined function to process data
on the database server called a MapReduce
API. Despite SAP lumping this API under its
Hadoop integration marketing, it has nothing
to do with Hadoop.
SAP BusinessObjects BI supports access to
Apache Hive schemas through the universe

INTEGRATION

HOME

concept, much as you might connect to any


other database. This type of connection theoretically allows access to data in many different storage systems through Hives external
table concept, including HBase, Cassandra
andMongoDB.

EDITORS NOTE
CONFUSED BY BIG
DATA HYPE?
BUZZWORDS DONT
HELP
CUT THROUGH
THE SAP-HADOOP
FOG
BENEATH THE
SURFACE OF DATA
TRANSFORMATION

THE HADOOP INTEGRATION PROMISE

So far weve seen that SAPs Hadoop integration is usually just Hive integration. Integrating
with Hive via HQL is great and is what most
vendors mean when they claim Hadoop integration. But its different than the image of deep
integration across the varied Hadoop ecosystem
tools that these vendors want to project.
SAP Data Services starts to deliver on the
Hadoop integration promise a bit more. In
addition to the ability to load data to and from
Hive, Data Services can create and read HDFS
files directly and do some transformation

SEIZE THE DATA WITH SAP

push-down operations using Pig scripts. This


means that data can be joined and filtered
directly in the Hadoop cluster rather than
needing to move to the Data Services server
to be processed. Data Services also is able to
offload its text data processing onto a Hadoop
cluster as MapReduce jobs. So here, SAP is
justified in implying deeper integration across
multiple Hadoop tools.
Lastly, a word of warning: The Hadoop ecosystem moves fast and enterprise software
often lags Hadoop. According to SAPs product
availability matrix, support for Hive, Pig and
HDFS are limited to fairly old versions that
dont support the latest improvements in performance, high availability and cluster capacity.
Check vendor claims of support for your versions of specific Hadoop tools carefully because
Hadoop versioning is confusing and enterprise software vendor representatives may not
understand it fully. n

OUTLOOK

Beneath the Surface of Data Transformation

HOME
EDITORS NOTE
CONFUSED BY BIG
DATA HYPE?
BUZZWORDS DONT
HELP
CUT THROUGH
THE SAP-HADOOP
FOG
BENEATH THE
SURFACE OF DATA
TRANSFORMATION

Data transformation and preparation,


data visualization and business intelligence
software are undergoing a sea change, even if it
sometimes seems nothing much has changed
in the last 15 years.
The transformation overtaking the industry
appears to be in its early days but is driven by
persistent problems with IT agility, data quality and the lack of transparency in the systems
that manage and display data.
We are clearly moving in the direction of
faster and more visual interaction with data,
but we are only scratching the surface with
regards to understanding andinteracting with
it.

THE OLD STANDBYS

In current standard software products, data


transformation operations like combining,
filtering and fixing data are strictly separate

SEIZE THE DATA WITH SAP

fromdata visualizationand analysis functions.


Transforming or changing data is a task usually reserved for technical people and accomplished, process-oriented tools like SAPsData
ServicesandBusiness Warehouse(BW) and
standard computer programming languages
likeJavaorPython.
The output of transformation toolsusually
fairly static database tablesis the input for
separate data analysis and visualization. Most
tools, likeSAPs Crystal Reports, allow users
to run prepared queries to illustrate a single
aggregated slice of the database. More advanced
data analysis tools allow the user to navigate
with some flexibility within the bounds of the
pre-existing data set. Usually these more flexible tools appear as analytics tools (SAPs Analysis for Office orDesign Studio dashboards),
though there is no reason these types of flexible but constrained analyses might not be useful in business process contexts.

OUTLOOK

HOME
EDITORS NOTE
CONFUSED BY BIG
DATA HYPE?
BUZZWORDS DONT
HELP
CUT THROUGH
THE SAP-HADOOP
FOG
BENEATH THE
SURFACE OF DATA

Some existing tools, usually billed as selfservice BI or data exploration, incorporate


basic data preparation capabilities, usually using
a process- or programming-based view of the
data preparation stage. Tableau Software and
QlikView were two pioneers of this approach,
providing fairly advanced data visualization
capabilities on a platform where the user was

A different approach to data


transformation trades the
process-oriented approach for
one more closely aligned with
the internal structure of the
data being processed.

TRANSFORMATION

responsible for all data loading and preparation


tasks.SAPs Lumirafollows in these footsteps,
giving users a way to load new data, connect to
existing data sets or join some combination of
data setsand then visualize the data.
But the strict separation of the visualization
or analysis process from data transformation is
a nagging weakness of all these tools. When do
people realize theres a problem with data that

10

SEIZE THE DATA WITH SAP

needs to be resolved? When they are visualizing it or running analytics functions on it. So
why not allow a user to fix the problem then
and there?

ON THE HORIZON

A different approach to data transformation


more closely aligned to the actual structure of
the data is emerging as a popular alternative. It
trades the process-oriented approach to data
transformation for one more closely aligned
with the internal structure of the data being
processed. That approach is to display even
very large data sets as spreadsheets and provide
the user with data transformation options that
are mapped onto the spreadsheet paradigm.
This is not a new approach, but the cohort of
tools (Open Refine,Data Wrangler, IBMs BigSheets) developed around 2010 to 2012, were
the first of this type of tool to gain widespread
adoption.
The idea is that the spreadsheet or table
is a pretty direct visual representation of the
raw structure of many standard data formats.
Showing a database table in a tabular format

OUTLOOK

HOME
EDITORS NOTE
CONFUSED BY BIG
DATA HYPE?
BUZZWORDS DONT
HELP
CUT THROUGH
THE SAP-HADOOP
FOG
BENEATH THE
SURFACE OF DATA
TRANSFORMATION

makes its structure and a small amount of the


data in the table explicit. Given the proper
tools, that structure and data can be manipulated in a way that is immediately visible in the
spreadsheet view, and which can be mapped
back on to the original data set.
It appears that spreadsheet-driven data
transformation has legs, getting good uptake
in the form ofOpenRefine. And its receiving significant attention in upcoming products
likeTrifactaandSpark Cloud, the latter of
which uses related concepts of tabular representations of data. This approach begins to
address the severe lack of analytics and visualization tools integrated into the data transformation process, giving the people processing
data the tools to assess and understand data as
they change it. But deep analytics and specialized visualization tools remain separate.

THE FUTURE

The current trend is to make data transformation a more visual experience, making the
results of data transformations on the data set
itself more explicit and immediate. But the job

11

SEIZE THE DATA WITH SAP

One cant really know how to


transform a data set without
understanding it, and its usually
in the process of extracting
meaning from data that we find
problems that need to be fixed.
of extracting meaning from data is still left to
more specialized interfaces, usually operating on aggregated slices of the full data set and
often featuring visual abstractions like charts
and graphs.
But theres a tension implicit in this arrangement: As stated already, understanding data
and extracting meaning from it is an integral
part of the process of transforming data. One
cant really know how to transform a data set
without understanding it, and its usually in
the process of extracting meaning from data
that we find problems with the data that need
to be fixedor realize that the data is incomplete for our purposes and has to be augmented
with another data set. In other words, the
process of visualization is exactly the point
at which we want to be able to change the

OUTLOOK

HOME
EDITORS NOTE
CONFUSED BY BIG
DATA HYPE?
BUZZWORDS DONT
HELP

underlying data, but our tools prohibit us from


doing this.
I expect that over the next five to 10 years,
we will begin to see this tension addressed in
earnest, with more products allowing editing or
augmentation of data through the visualization
interface.
Its currently an active area of research,
including in thePalladioresearch project,
on whichfull disclosureI work as lead
developer.
In some sense, the products based on the

CUT THROUGH
THE SAP-HADOOP
FOG
BENEATH THE
SURFACE OF DATA
TRANSFORMATION

12

SEIZE THE DATA WITH SAP

spreadsheet paradigm are one of the first massmarket implementations of this approach.
Most likely, these products and others like
them will continue to improve their visualization capabilities while maintaining the ability
to change data through these visualizations.
If visualization-focused vendors are paying
attention, they will also start to incorporate
data manipulation capabilities into their visualization tools. It will be interesting to see who
will manage to address this gap most quickly
and comprehensively. n

ABOUT
THE
AUTHOR

ETHAN JEWETT is an independent consultant and SAP

Mentor who focuses on business intelligence, data management and performance management. Follow him on
Twitter: @esjewett.
Seize the Data With SAP
is a SearchSAP.com e-publication.

HOME

Scot Petersen | Editorial Director

EDITORS NOTE

Jason Sparapani | Managing Editor, E-Publications


CONFUSED BY BIG

Joe Hebert | Associate Managing Editor, E-Publications

DATA HYPE?
BUZZWORDS DONT

David Essex | Executive Editor

HELP

Linda Koury | Director of Online Design

CUT THROUGH

Neva Maniscalco | Graphic Designer

THE SAP-HADOOP
FOG

Doug Olender | Publisher | dolender@techtarget.com

BENEATH THE

Annie Matthews | Director of Sales


amatthews@techtarget.com

SURFACE OF DATA
TRANSFORMATION

TechTarget
275 Grove Street, Newton, MA 02466
www.techtarget.com
2014 TechTarget Inc. No part of this publication may be transmitted or reproduced in any form or by any means without written permission from the
publisher. TechTarget reprints are available through The YGS Group.

STAY CONNECTED!
Follow @SearchSAP today

13

SEIZE THE DATA WITH SAP

About TechTarget: TechTarget publishes media for information technology


professionals. More than 100 focused websites enable quick access to a deep
store of news, advice and analysis about the technologies, products and processes crucial to your job. Our live and virtual events give you direct access to
independent expert commentary and advice. At IT Knowledge Exchange, our
social community, you can get advice and share solutions with peers and experts.
COVER PHOTOGRAPH: DIGITAL VISION/THINKSTOCK