Vous êtes sur la page 1sur 318

PREVIEW VERSION

This guide is based on the May 2013 release of HDInsight.


Later releases may differ from the version described in this guide.

Copyright .......................................................................................................................................................6
Preface ..........................................................................................................................................................7
What This Guide Is, and What It Is Not ..................................................................................................................8
Who This Guide Is For .............................................................................................................................................9
Why This Guide Is Pertinent Now ...........................................................................................................................9
A Roadmap for This Guide ................................................................................................................................... 10
Using the Examples for this Guide ...................................................................................................................... 12
Who's Who .......................................................................................................................................................... 13
Chapter 1: What is Big Data? ........................................................................................................................ 14
What Is Big Data? ................................................................................................................................................ 14
What Is Microsoft HDInsight? ............................................................................................................................. 26
Where Do I Go Next? ........................................................................................................................................... 33
Summary .............................................................................................................................................................. 34
More Information ................................................................................................................................................ 34
Chapter 2: Getting Started with HDInsight .................................................................................................... 35
An Overview of the Big Data Process .................................................................................................................. 35
Determining Analytical Goals and Source Data ................................................................................................... 38
Planning and Configuring the Infrastructure ....................................................................................................... 42
Summary .............................................................................................................................................................. 66
More Information ................................................................................................................................................ 67
Chapter 3: Collecting and Loading Data ......................................................................................................... 68
Challenges............................................................................................................................................................ 68
Common Data Load Patterns for Big Data Solutions ........................................................................................... 70
Uploading Data to HDInsight ............................................................................................................................... 75
Summary .............................................................................................................................................................. 81
More Information ................................................................................................................................................ 81
Chapter 4: Performing Queries and Transformations..................................................................................... 82
Challenges............................................................................................................................................................ 82
Processing the Data ............................................................................................................................................. 83
Evaluating the Results ....................................................................................................................................... 107
Tuning the Solution ........................................................................................................................................... 109
Summary ............................................................................................................................................................ 120
More Information .............................................................................................................................................. 120
Chapter 5: Consuming and Visualizing the Results ....................................................................................... 121
Challenges.......................................................................................................................................................... 121
Options for Consuming Data from HDInsight .................................................................................................... 122
Consuming HDInsight Data in Microsoft Excel .................................................................................................. 124
Consuming HDInsight Data in SQL Server Reporting Services .......................................................................... 134
Consuming HDInsight Data in SQL Server Analysis Services ............................................................................. 139
Consuming HDInsight Data in a SQL Server Database....................................................................................... 143
Custom and Third Party Analysis and Visualization Tools ................................................................................. 149
Summary ............................................................................................................................................................ 150
More Information .............................................................................................................................................. 150
Chapter 6: Debugging, Testing, and Automating HDInsight Solutions ........................................................... 151
Challenges.......................................................................................................................................................... 151
Debugging and Testing HDInsight Solutions...................................................................................................... 152
Executing HDInsight Jobs Remotely .................................................................................................................. 155
Creating and Using Automated Workflows ....................................................................................................... 159
Building End-to-End Solutions on HDInsight ..................................................................................................... 164
Summary ............................................................................................................................................................ 165
More Information .............................................................................................................................................. 165
Chapter 7: Managing and Monitoring HDInsight ......................................................................................... 166
Challenges.......................................................................................................................................................... 166
Administering HDInsight Clusters and Jobs ....................................................................................................... 167
Configuring HDInsight Clusters and Jobs ........................................................................................................... 173
Retaining Data and Metadata for a Cluster ....................................................................................................... 180
Monitoring HDInsight Clusters and Jobs ........................................................................................................... 183
Summary ............................................................................................................................................................ 186
More Information .............................................................................................................................................. 186
Scenario 1: Iterative Data Exploration ......................................................................................................... 187
Finding Insights in Data ..................................................................................................................................... 187
Introduction to Blue Yonder Airlines ................................................................................................................. 188
Analytical Goals and Data Sources .................................................................................................................... 188
Iterative Data Processing ................................................................................................................................... 190
Consuming the Results ...................................................................................................................................... 211
Summary ............................................................................................................................................................ 213
Scenario 2: Extract, Transform, Load ........................................................................................................... 214
Introduction to Northwind Traders ................................................................................................................... 214
ETL Process Goals and Data Sources ................................................................................................................. 214
The ETL Workflow .............................................................................................................................................. 215
Encapsulating the ETL Tasks in an Oozie Workflow .......................................................................................... 231
Automating the ETL Workflow with the Microsoft .NET SDK for Hadoop ........................................................ 240
Summary ............................................................................................................................................................ 245
Scenario 3: HDInsight as a Data Warehouse ................................................................................................ 246
Introduction to A. Datum .................................................................................................................................. 246
Analytical Goals and Data Sources .................................................................................................................... 246
Creating the Data Warehouse ........................................................................................................................... 247
Loading Data into the Data Warehouse ............................................................................................................ 250
Optimizing Query Performance ......................................................................................................................... 254
Analyzing Data from the Data Warehouse ........................................................................................................ 258
Summary ............................................................................................................................................................ 261
Scenario 4: Enterprise BI Integration ........................................................................................................... 262
Introduction to Adventure Works ..................................................................................................................... 262
The Existing Enterprise BI Solution .................................................................................................................... 262
The Analytical Goals .......................................................................................................................................... 268
The HDInsight Solution ...................................................................................................................................... 269
Report-Level Integration ................................................................................................................................... 272
Corporate Data Model Integration .................................................................................................................... 276
Data Warehouse Integration ............................................................................................................................. 280
Summary ............................................................................................................................................................ 290
Appendix A: Data Ingestion Patterns using the Twitter API.......................................................................... 291
About the Twitter API ........................................................................................................................................ 291
Using Interactive Data Ingestion ....................................................................................................................... 293
Automating Data Ingestion with SSIS ................................................................................................................ 297
Capturing Real-Time Data from Twitter with StreamInsight ............................................................................. 302
Appendix B: An Overview of Enterprise Business Intelligence ...................................................................... 308
Business Data Sources ....................................................................................................................................... 309
ETL ..................................................................................................................................................................... 310
The Data Warehouse ......................................................................................................................................... 312
Corporate Data Models ..................................................................................................................................... 316


Copyright
This document is provided as-is. Information and views expressed in this document, including URL and other
Internet website references, may change without notice. You bear the risk of using it. Some examples depicted
herein are provided for illustration only and are fictitious. No real association or connection is intended or
should be inferred.
2013 Microsoft. All rights reserved.
Microsoft, Microsoft Dynamics, Active Directory, MSDN, SharePoint, SQL Server, Visual C#, Visual C++, Visual
Basic, Visual Studio, Windows, Windows Azure, Windows Live, Windows PowerShell, Windows Server, and
Windows Vista are trademarks of the Microsoft group of companies.
All other trademarks are the property of their respective owners.
Preface
Do you know what visitors to your website really think about your carefully crafted content? Or, if you run a
business, can you tell what your customers actually think about your products or services? Did you realize that
your latest promotional campaign had the biggest effect on people aged between 40 and 50 living in Wisconsin
(and, more importantly, why)?
Being able to get answers to these kinds of questions is increasingly vital in today's competitive environment,
but the source data that can provide these answers is often hidden away; and when you can find it, it's very
difficult to analyze successfully. It might be distributed across many different databases or files, be in a format
that is hard to process, or may even have been discarded because it didnt seem useful at the time.
To resolve these issues, data analysts and business managers are fast adopting techniques that were commonly
at the core of data processing in the past, but have been sidelined in the rush to modern relational database
systems and structured data storage. The new buzzword is Big Data, and the associated solutions encompass a
range of technologies and techniques that allow you to extract real, useful, and previously hidden information
from the often very large quantities of data that previously may have been left dormant and, ultimately, thrown
away because storage was too costly.
In the days before Structured Query Language (SQL) and relational databases, data was typically stored in flat
files, often is simple text format, with fixed width columns. Application code would open and read one or more
files sequentially, or jump to specific locations based on the known line width of a row, to read the text and
parse it into columns to extract the values. Results would be written back by creating a new copy of the file, or a
separate results file.
Modern relational databases put an end to all this, giving us greater power and additional capabilities for
extracting information simply by writing queries in a standard format such as SQL. The database system hides all
the complexity of the underlying storage mechanism and the logic for assembling the query that extracts and
formats the information. However, as the volume of data that we collect continues to increase, and the native
structure of this information is less clearly defined, we are moving beyond the capabilities of even enterprise-
level relational database systems.
Big Data solutions provide techniques for storing vast quantities of structured, semi-structured, and
unstructured data that is distributed across many data stores, and querying this data where it's stored instead of
moving it all across the network to be processed in one locationas is typically the case with relational
databases. The huge volumes of data, often multiple Terabytes or Petabytes, means that distributed queries are
far more efficient than streaming all the source data across a network.
Big Data solutions also deal with the issues of data formats by allowing you to store the data in its native form,
and then apply a schema to it later, when you need to query it. This means that you dont inadvertently lose any
information by forcing the data into a format that may later prove to be too restrictive. It also means that you
can simply store the data noweven though you dont know how, when, or even whether it will be usefulsafe
in the knowledge that, should the need arise in the future, you can extract any useful information it contains.
Microsoft offers a Big Data solution called HDInsight, which is currently available as an online service in Windows
Azure. There is also a version of the core HortonWorks Data Platform (on which HDInsight is based) that you can
install on Windows Server as an on-premises mechanism. This guide primarily focuses on HDInsight for Windows
Azure, but explores more general Big Data techniques as well. For example, it provides guidance on collecting
and storing the data, and designing and implementing real-time and batch queries using the Hive and Pig query
languages, as well as using custom map/reduce components.
Big Data solutions can help you to discover information that you didnt know existed, complement your existing
knowledge about your business and your customers, and boost competitiveness. By using the cloud as the data
store and Windows Azure HDInsight as the query mechanism you benefit from very affordable storage costs (at
the time of writing 1TB of Windows Azure storage costs only $95 per month), and the flexibility and elasticity of
the pay and go model where you only pay for the resources you use.

What This Guide Is, and What It Is Not
Big Data is not a new concept. Distributed processing has been the mainstay of supercomputers and high
performance data storage clusters for a long time. Whats new is standardization around a set of open source
technologies that make distributed processing systems much easier to build, combined with the growing need to
manage and get information from the ever increasing volume of data that modern society generates. However,
as with most new technologies, Big Data is surrounded by a great deal of hype that often gives rise to unrealistic
expectations, and ultimate disappointment.
Like all of the releases from Microsoft patterns & practices, this guide avoids the hype by concentrating on the
whys and the hows. In terms of the whys, the guide explains the concepts of Big Data, gives a focused
view on what you can expect to achieve with a Big Data solution, and explores the capabilities of Big Data in
detail so that you can decide for yourself if it is an appropriate technology for your own scenarios. The guide
does this by explaining what a Big Data solution can do, how it does it, and the types of problems it is designed
to solve.
In terms of the hows the guide continues with a detailed examination of the typical architectural models for
Big Data solutions, and the ways that these models integrate with the wider data management and business
information environment, so that you can quickly understand how you might apply a Big Data solution in your
own environment.
The guide then focuses on the technicalities of implementing a Big Data solution using Windows Azure
HDInsight. It does this by helping you understand the fundamentals of the technology, when it is most useful in
different scenarios, and the advantages and considerations that are relevant. This technical good-practice
guidance is enhanced by a series of examples of real-world scenarios that focus on different areas where Big
Data solutions are particularly relevant and useful. For each scenario you will see the aims, the design concept,
the tools, and the implementation; and you can download the example code to run and experiment with
yourself.
This guide is not a technical reference manual for Big Data. It doesnt attempt to cover every nuance of
implementing a Big Data solution, or writing complicated code, or pushing the boundaries of what the
technology is designed to do. Neither does it cover all of the myriad details of the underlying mechanismthere
is a multitude of books, websites, and blogs that explain all these things. For example, you wont find in this
guide a list of all of the configuration settings, the way that memory is allocated, the syntax of every type of
query, and the many patterns for writing map and reduce components.
What you will see is guidance that will help you understand and get started using HDInsight to build realistic
solutions capable of answering real world questions.

This guide is based on the May 2013 release of HDInsight on Windows Azure. Later releases
of HDInsight may differ from the version described in this guide.

Who This Guide Is For
This guide is aimed at anyone who is already (or is thinking of) collecting large volumes of data, and needs to
analyze it to gain a better business or commercial insight into the information hidden inside. This includes
business managers, data analysts, database administrators, business intelligence (BI) system architects, and
developers that need to create queries against massive data repositories.

Why This Guide Is Pertinent Now
Businesses and organizations are increasingly collecting huge volumes of data that may be useful now or in the
future, and they need to know how to store and query it to extract the hidden information it contains. This
might be web server log files, click-through data, financial information, medical data, user feedback, or a range
of social sentiment data such as tweets or comments to blog posts.
Big Data techniques and mechanisms such as HDInsight provide a mechanism to efficiently store this data,
analyze it to extract useful information, and export data to tools and applications that can visualize the
information in a range of ways. It is, realistically, the only way to handle the volume and the inherent
unstructured nature of this data.
No matter what type of service you provide, what industry or market sector you are in, or even if you only run a
blog or website forum, you are highly likely to benefit from collecting and analyzing data that is easily available,
often collected automatically (such as server log files), or can be obtained from other sources and combined
with your own data to help you better understand your customers, your users, and your business; and to help
you plan for the future.

A Roadmap for This Guide
This guide contains a range of information to help you understand where and how you might apply Big Data
solutions. This is the map of our guide:


Chapter 1, What is Big Data? provides an overview of the principles and benefits of Big Data solutions, and the
differences between these and the more traditional database systems, to help you decide where, when, and
how you might benefit from adopting a Big Data solution. This chapter also discusses Windows Azure HDInsight,
and its place within the comprehensive Microsoft data platform.
Chapter 2, Getting Started with HDInsight, explains how you can implement a Big Data solution using
Windows Azure HDInsight. It contains general guidance for applying Big Data techniques by exploring in more
depth topics such as defining the goals, locating data sources, the typical architectural approaches, and more.
You will find much of the information about the design and development lifecycle of a Big Data solution useful,
even if you choose not to use HDInsight as the platform for your own solution.
Chapter 3, Collecting and Loading Data, explores the patterns and techniques for loading data into an
HDInsight cluster. It describes several different approaches such as handling streaming data, automating the
process, and source data compression.
Chapter 4, Performing Queries and Transformations, describes the tools you can use in HDInsight to query
and manipulate data in a cluster. This includes using the interactive console; and writing scripts and commands
that are executed from the command line. The chapter also includes advice on evaluating the results and
maximizing query performance.
Chapter 5, Consuming and Visualizing the Results, explores the ways that you can transfer the results from
HDInsight into analytical and visualization tools to generate reports and charts, and how you can export the
results into existing data stores such as databases, data warehouses, and enterprise BI systems.
Chapter 6, Debugging, Testing, and Automating HDInsight, explores how, if you discover a useful and
repeatable process that you want to integrate with your existing business systems, you can automate all or part
of the process. The chapter looks at techniques for testing and debugging queries, unit testing your code, and
using workflow tools and frameworks to automate your HDInsight solutions.
Chapter 7, Managing and Monitoring HDInsight, explores how you can manage and monitor an HDInsight
solution. It includes topics such as configuring the cluster, backing up the data, remote administration, and using
the logging capabilities of HDInsight.

The remaining chapters of the guide concentrate on specific scenarios for applying a Big Data solution, ranging
from analysis of social media data to integration with enterprise BI systems. Each chapter explores specific
techniques and solutions relevant to the scenario, and shows an implementation based on Windows Azure
HDInsight. The scenarios the guide covers are:
Chapter 8, Scenario 1: Iterative Exploration, demonstrates the way that you can use HDInsight to see
if there is any useful information to be found in unstructured data that you may already be collecting.
This could be data that refers to your products and services, such as emails, comments, feedback, and
social media posts, or even data that refers to a specific topic or market sector that you are interested in
investigating. In the example you will see how a company used data from Twitter to look for interesting
information about its services.
Chapter 9, Scenario 2: Extract, Transform, Load, explores the use of HDInsight as a tool for performing
ETL operations. It describes how you can capture data that is not in the traditional row and column
format (in this example, XML formatted data) and apply data cleansing, validation, and transformation
processes to convert the data into the form required by your database system. The scenario centers on
analyzing financial data. In addition to using the information to predict future patterns and make
appropriate business decisions, it also provides the capability to detect unusual events that may indicate
fraudulent activity.
Chapter 10, Scenario 3: HDInsight as a Data Warehouse, describes how you can use HDInsight as both
a commodity store for huge volumes of data, and as a simple data warehouse mechanism that can
power your organizations applications. The scenario uses geospatial data about extreme weather
events in the US, and shows how you can design and implement a data warehouse that can store,
process, enrich, and then expose this data to visualization tools.

Chapter 11, Scenario 4: Enterprise BI Integration covers one of the most common scenarios for Big
Data solutions: analyzing log files to obtain information about visitors or users, and visualizing this
information in a range of different ways. This information can help you to better understand trends and
traffic patterns; plan for the required capacity now and into the future; and discover which services,
products, or areas of a website are underperforming. In this example you will see how a company was
able to extract useful data and then integrate it with existing BI systems at different levels, and by using
different tools.

Finally, the appendices contain additional information related to the scenarios described in the guide:
Appendix A, Data Ingestion Patterns using the Twitter API, demonstrates the three main patterns for
loading data into HDInsight. These patterns are interactive data ingestion, automated data ingestion
using SQL Server Integration Services, and capturing streaming data using Microsoft StreamInsight.
Appendix B, Enterprise BI Integration, provides additional background information about enterprise
data warehouses and BI that, if you are not familiar with these systems, you will find helpful when
reading Chapter 2 and the BI Integration scenarios described in the guide.
Using the Examples for this Guide
The scenarios you see described in this guide explore the techniques and technologies of HDInsight, Hadoop,
and many related products and frameworks. The chapters in the guide discuss the goals for each scenario and
the practical solutions to the challenges encountered, as well as providing an overview of the implementation.
If you want to try out these scenarios yourself, you can use the Hands-on Labs that are available separately for
this guide. These labs contain all of the source code, scripts, and other items used in all of the scenarios;
together with detailed step by step instructions on accomplishing each task in the scenarios.
To explore these labs, download them from http://wag.codeplex.com/releases/view/103405 and follow these
steps:
Save the download file to your hard drive in a temporary folder. The samples are packaged as a zip file.
Unzip the contents into your chosen destination folder. This must not be a folder in your profile. Use a
folder near the root of your drive with a short name, such as C:\Labs, to avoid issues with the length of
filenames.
See the document named Introduction and Setup in the root folder of the downloaded files for
instructions on installing the prerequisite software for the labs.

To experiment with HDInsight you must have a Windows Azure subscription. You will need a Microsoft account
to sign up for Windows Azure subscription. You can then create an HDInsight cluster using the Windows Azure
Management Portal. For a detailed explanation of the other prerequisites for the scenarios described guide, see
the Setup and Installation document in the root folder of the Hands-on Labs associated with this guide.
Who's Who
This guide explores the use of Big Data solutions using Windows Azure HDInsight. A panel of experts comments
on the design and development of the solutions for the scenarios discussed in the guide. The panel includes a
business analyst, a data steward, a software developer, and an IT professional. The delivery of the scenario
solutions can be considered from each of these points of view. The following table lists these experts.


Bharath is a systems architect who specializes in Big Data solutions and BI integration. He ensures that
the solutions he designs will meet the organizations needs, work reliably and efficiently, and provide
tangible benefits. He is a cautious person, for good reasons.
"Designing Big Data solutions can be a challenge, but the benefits this technology offers can help you to
understand more about your company, products, and customers; and help you plan for the future".

Jana is a business analyst. She is responsible for planning the organizations business intelligence
systems, and managing the quality of the data held in the organizations databases and data warehouse.
She uses her deep knowledge of corporate data schemas to oversee the validation of the data, and
correction of errors such as removing duplication and enforcing common terminology, to ensure that it
provides the correct results.
"Business intelligence is a core part of all corporations today. Its vital that they provide accurate, timely,
and useful information, which means we must be able to trust the data we use."

Poe is an IT professional and database administrator whose job is to plan and manage the
implementation of all aspects of the infrastructure for his organization. He is responsible for implementing
the global infrastructure and database systems, including both on-premises and cloud based systems and
services, so that they support the data management, analytics, and business intelligence mechanisms the
organization requires.
"I need to implement and manage our systems in a way that provides the best performance, security, and
management capabilities possible while making sure our infrastructure, data storage systems, and
applications are reliable and meet the needs of our users."

Markus is a database and application developer. He is an expert in programming for database systems
and building applications, both in the cloud and in the corporate datacenter. He is analytical, detail-
oriented, and methodical; and focused on the task at hand, which is building successful business
solutions and analytics systems.
"For the most part, a lot of what we know about database programming and software development can be
applied to Big Data solutions. But there are always special considerations that are very important."

If you have a particular area of interest, look for notes provided by the specialists whose interests align with
yours.
Chapter 1: What is Big Data?
This chapter discusses the general concepts of Big Data. It describes the ways that Big Data solutions, and
Microsoft's HDInsight for Windows Azure, differ from traditional data management and business intelligence (BI)
techniques. The chapter also sets the scene for the guide and provides pointers to the subsequent chapters that
describe the implementation of Big Data solutions suited to different scenarios, and to meet a range of data
analysis requirements.
What Is Big Data?
The term Big Data is being used to encompass an increasing range of technologies and techniques. In essence,
Big Data is data that is valuable but, traditionally, was not practical to store or analyze due to limitations of cost
or the absence of suitable mechanisms. Typically it refers to collections of datasets that, due to the size and
complexity, are difficult to store, query, and manage using existing data management tools or data processing
applications.
You can also think of Big Data as data, often produced at fire hose rate, that you don't know how to analyze at
the moment but which may provide valuable in the future. Big Data solutions aim to provide data storage and
querying functionality for situations such as this. They offer a mechanism for organizations to extract
meaningful, useful, and often vital information from the vast stores of data they are collecting.
Big Data is often described as a solution to the three V's problem:
Variety: It's not uncommon for 85% of new data to not match any existing data schema. It may also be
semi-structured or unstructured data. This means that applying schemas to the data before or during
storage is no longer a practical option.
Volume: Big Data solutions typically store and query hundreds of Terabytes of data, and the total
volume is probably growing by ten times every five years. Storage must be able to manage this volume,
be easily expandable, and work efficiently across distributed systems.
Velocity: Data is being collected from many new types of devices from a growing number of users, and
an increasing number of devices and applications per user. The design and implementation of storage
must happen quickly and efficiently.

The quintessential aspect of Big Data is not the data itself; its the ability to discover useful
information hidden in the data. Its really all about the analytics that a Big Data solution can
empower.
Many people consider Big Data solutions to be a new way to do data warehousing when the volume of data
exceeds the capacity or cost limitations for relational database systems. However, it can be difficult to fully grasp
what Big Data solutions really involve, what hardware and software they use, and how and when they are
useful. There are some basic questions, the answers to which will help you understand where Big Data solutions
are useful - and how you might approach the topic in terms of implementing your own solutions. We'll work
through these questions now.
Why Do I Need a Big Data Solution?
In the most simplistic terms, organizations need a Big Data solution to enable them to survive in a rapidly
expanding and increasingly competitive market. For example, they need to know how their products and
services are perceived in the market, what customers think of the organization, whether advertising campaigns
are working, and which facets of the organization are (or are not) achieving their aims.
Organizations typically collect data that is useful for generating BI reports, and to provide input for management
decisions. However, organizations are increasingly implementing mechanisms that collect other types of data
such as sentiment data (emails, comments from web site feedback mechanisms, and tweets that are related
to the organization's products and services), click-through data, information from sensors in users' devices (such
as location data), and website log files.
Successful organizations typically measure performance by discovering the customer value that each
part of their operation generates. Big Data solutions provide a way to help you discover value, which often
cannot be measured just through traditional business methods such as cost and revenue analysis.
These vast repositories of data often contain useful, and even vital, information that can be used for product
and service planning, coordinating advertising campaigns, improving customer service, or as an input to
reporting systems. This information is also very useful for predictive analysis such as estimating future
profitability in a financial scenario, or for an insurance company to predict the possibility of accidents and
claims. Big Data solutions allow you to store and extract all this information, even if you dont know when or
how you will use the data at the time you are collecting it.
In addition, many organizations need to handle vast quantities of data as part of their daily operations. Examples
include financial and auditing data, and medical data such as patients' notes. Processing, backing up, and
querying all of this data becomes more complex and time consuming as the volume increases. Big Data solutions
are designed to store vast quantities of data on distributed servers with automatic generation of replicas to
guard against data loss, together with mechanisms for performing queries on the data to extract the information
the organization requires.
What Problems Do Big Data Solutions Solve?
Big Data solutions were initially seen to be primarily a way to resolve the limitation with traditional database
systems due to the volume, variety, and velocity associated with Big Data; they are built to store and process
hundreds of Terabytes, or even Petabytes, of data.
This isnt to say that relational databases have had their day. Continual development of the hardware and
software for this core business function provides capabilities for storing very large amounts of data; for example
Microsoft Parallel Data Warehouse can store hundreds of Terabytes of data. However, Big Data solutions are
optimized for the storage of huge volumes of data in a way that can dramatically reduce storage cost, while still
being able to generate BI and comprehensive reports.

Big Data solutions typically target scenarios where there is a huge volume of unstructured or semi-
structured data that must be stored and queried to extract business intelligence. Typically, 85% of data currently
stored in Big Data solutions is unstructured or semi-structured.
In terms of the variety of data, organizations often collect unstructured data, which is not in a format that suits
relational database systems. Some data, such as web server logs and responses to questionnaires may be
preprocessed into the traditional row and column format. However, data such as emails, web site comments
and feedback, and tweets are semi-structured or even unstructured data. Deciding how to store this data using
traditional database systems is problematic, and may result in loss of useful information if the data must be
constricted to a specific schema when it is stored.
Finally, the velocity of Big Data may make storage in an enterprise data warehouse problematic, especially
where formal preparation processes such as examining, conforming, cleansing, and transforming the data must
be accomplished before it is loaded into the data warehouse tables.
The combination of all these factors means that a Big Data solution may be a more practical proposition than a
traditional relational database system in some circumstances. However, as Big Data solutions have continued to
evolve, it has become clear that they can also be used in a fundamentally different context: to quickly get
insights into data, and to provide a platform for further investigation in a way that just isnt possible with
traditional data storage, management, and querying tools.
Figure 1 demonstrates how you might go from a semi-intuitive guess at the kind of information that might be
hidden in the data, to a process that incorporates that information into your business process.

Figure 1
Big Data solutions as an experimental data investigation platform
As an example, you may want to explore the postings by users of a social website to discover what they are
saying about your company and its products or services. Using a traditional BI system would mean waiting for
the database architect and administrator to update the schemas and models, cleanse and import the data, and
design suitable reports. But its unlikely that youll know beforehand if the data is actually capable of providing
any useful information, or how you might go about discovering it. By using a Big Data solution you can
investigate the data by asking any questions that may seem relevant. If you find one or more that provide the
information you need you can refine the queries, automate the process, and incorporate it into your existing BI
systems.
Big Data solutions arent all about business topics such as customer sentiment or
webserver log file analysis. They have many diverse uses around the world and across all
types of application. Police forces are using Big Data techniques to predict crime patterns,
researchers are using them to explore the human genome, particle physicists are using
them to search for information about the structure of matter, and astronomers are using
them to plot the entire universe. Perhaps the last of these really is a big Big Data solution!
How Is a Big Data Solution Different from Traditional Database Systems?
Traditional database systems typically use a relational model where all the data is stored using predetermined
schemas, and linked using the values in specific columns of each table. There are some more flexible
mechanisms, such as the ability to store XML documents and binary data, but the capabilities for handling these
types of data are usually quite limited.
Big Data solutions do not force a schema onto the stored data. Instead, you can store almost any type of
structured, semi-structured, or unstructured data and then apply a suitable schema only when you query this
data.
Big Data solutions are optimized for storing vast quantities of data using simple file formats and highly
distributed storage mechanisms. Each distributed node is also capable of executing parts of the queries that
extract information. Whereas a traditional database system would need to collect all the data from storage and
move it to a central location for processing, with the consequent limitations of processing capacity and network
latency, Big Data solutions perform the initial processing of the data at each storage node. This means that the
bulk of the data does not need to be moved over the network for processing (assuming that you have already
loaded it into the cluster storage), and it requires only a fraction of the central processing capacity to execute a
query.
Relational databases systems require a schema to be applied when data is written, which may mean that
some information hidden in the data is lost. Big Data solutions store the data in its raw format and apply a
schema only when the data is read, which preserves all of the information within the data.
Figure 2 shows some of the basic differences between a relational database system and a Big Data solution in
terms of storing and querying data. Notice how both relational databases and Big Data solutions use a cluster of
servers; the main difference is where query processing takes place and how the data is moved across the
network.

Figure 2
Some differences between relational databases and Big Data solutions
Modern data warehouse systems typically use high speed fiber networks, and in-memory caching and indexes,
to minimize data transfer delays. However, in a Big Data solution only the results of the distributed query
processing are passed across the cluster network to the node that will assemble them into a final results set.
Performance during the initial stages of the query is limited only by the speed and capacity of connectivity to the
co-located disk subsystem, and this initial processing occurs in parallel across all of the cluster nodes.
The servers in a cluster are typically co-located in the same datacenter and connected over a
low-latency, high-bandwidth network. However, Big Data solutions can work well even
without a high capacity network, and the servers can be more widely distributed, because
the volume of data moved over the network is much less than in a traditional relational
database cluster.
The ability to work with highly distributed data and simple file formats also opens up opportunities for more
efficient and more comprehensive data collection. For example, services and applications can store data in any
of the predefined distributed locations without needing to preprocess it or execute queries that can absorb
processing capacity. Data is simply appended to the files in the data store. Any processing required on the data
is done when it is queried, without affecting the original data and risking losing valuable information.
The following table summarizes the major differences between a Big Data solution and existing relational
database systems.

Feature Relational database systems Big Data solutions
Data types and formats Structured Semi-structured and unstructured
Data integrity High - transactional updates Low - eventually consistent
Schema Static - required on write Dynamic - optional on read and write
Read and write pattern Fully repeatable read/write Write once, repeatable read
Storage volume Gigabytes to terabytes Terabytes, petabytes, and beyond
Scalability Scale up with more powerful hardware Scale out with additional servers
Data processing distribution Limited or none Distributed across cluster
Economics Expensive hardware and software Commodity hardware and open source
software

What Technologies Do Big Data Solutions Encompass?
Having talked in general terms about Big Data, it's time to be more specific about the hardware and software
that it encompasses. The core of Big Data is an open source technology named Hadoop. Hadoop was developed
by a team at the Apache Software Foundation. It is commonly understood to contain three main assets:
The Hadoop kernel library that contains the common routines and utilities.
The distributed file system for storing data files and managing clusters; usually referred to as HDFS
(highly distributed file system).
The map/reduce library that provides the framework to run queries against the data.


In addition there are many other open source components and tools that can be used with Hadoop. The Apache
Hadoop website lists the following:
Avro: A data serialization system.
Cassandra: A scalable multi-master database with no single points of failure.
Chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that supports structured data storage for large tables.
HCatalog: A table and storage management service for data created using Apache Hadoop.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper: A high-performance coordination service for distributed applications.

We'll be exploring and using some of these components in the scenarios described in subsequent chapters of
this guide. For more information about all of them, see the Apache Hadoop website.
However, this guide concentrates on Microsoft's implements a Big Data solution, named HDInsight, and
specifically on the cloud-based implementation of HDInsight in Windows Azure. Youll find a more detailed
description of HDInsight later in this chapter. You will also see some general discussion of Big Data and Hadoop
technologies and techniques throughout the rest of this guide.
How Does a Big Data Solution Work?
Big Data solutions are essentially a simple process based on storing data as multiple replicated disk files on a
distributed set of servers, and executing a multistep query that runs on each server in the cluster to extract the
results from each one and then combine these into the final results set.
The Data Cluster
Big Data solutions use a cluster of servers to store and process the data. Each member server in the cluster is
called a data node, and contains a data store and a query execution engine. The cluster is managed by a server
called the name node, which has knowledge of all the cluster servers and the files stored on each one. The name
node server does not store any data, but is responsible for allocating data to the other cluster members and
keeping track of the state of each one by listening for heartbeat messages.
To store incoming data, the name node server directs the client to the appropriate data node server. The name
node also manages replication of data files across all the other cluster members, which communicate with each
other to replicate the data. The data is divided into chunks and three copies of each data file are stored across
the cluster servers in order to provide resilience against failure and data loss (the chunk size and the number of
replicated copies are configurable for the cluster).

Its vital to maintain a backup of the name node server in a cluster. Without it, all of the cluster data may
be lost. Windows Azure HDInsight stores the data in blob storage to protect against this kind of disaster.
In some implementations there is only a single name node server, which results in a single point of failure for
the entire cluster. If the name node server fails, all knowledge of the location of the cluster servers and data
may be lost. To avoid this, Big Data implementations usually contain a replication or backup feature for the
name node server (often referred to as the secondary name node), which allows the cluster to survive a failure.
The Data Store
The data store running on each server in a cluster is a suitable distributed storage service such as HDFS or a
compatible equivalent; Windows Azure HDInsight provides a compatible service over Windows Azure blob
storage. A core principal of Big Data implementations is that data is co-located on the same server (or in the
same datacenter) as the query execution engine that will process it in order to minimize bandwidth use and
maximize query performance.
Windows Azure HDInsight (and some other virtualized environments) does not completely follow this principle
because the data is stored in the datacenter storage cluster that is co-located with virtualized servers.
The data store in a Big Data implementation is usually referred to as a NoSQL store, although this is not
technically accurate because some implementations do support a SQL-like query language. In fact, some people
prefer to use the term Not Only SQL for just this reason. There are more than 120 NoSQL data store
implementations available at the time of writing, but they can be divided into the following four basic
categories:
Key/value stores. These are data stores that hold data as a series of key/value pairs. The value may be a
single data item or a complex data structure. There is no fixed schema for the data, and so these types
of data store are ideal for unstructured data. An example of a key/value store is Windows Azure table
storage, where each row has a key and a property bag containing one or more values. Key/value stores
can be persistent or volatile.
Document stores. These are data stores optimized to store structured, semi-structured, and
unstructured data items such as JSON objects, XML documents, and binary data. An example of a
document store is Windows Azure blob storage, where each item is identified by a blob name within a
virtual folder structure.
Wide column or Column Family data stores. These are data stores that do use a schema, but the schema
can contain families of columns rather than actual columns. They are ideally suited to storing semi-
structured data where some columns can be predefined but others are capable of storing differing
elements of unstructured data. HBase running on HDFS is an example.
Graph data stores. These are data stores that hold the relationships between objects. They are less
common than the other types of data store, and tend to have only specialist uses.

NoSQL storage is typically much cheaper than relational storage, and usually supports a Write Once capability
that allows only for data to be appended. To update data in these stores you must drop and recreate the
relevant file. This limitation maximizes throughput; Big Data storage implementations are usually measured by
throughput rather than capacity because this is usually the most significant factor for both storage and query
efficiency.
Modern data handling techniques, such as the Command Query Responsibility Separation
(CQRS) and other patterns, do not encourage updates to data. Instead, new data is added
and milestone records are used to fix the current state of the data at intervals. This
approach provides better performance and maintains the history of changes to the data.
For more information about CQRS see the patterns & practices guide CQRS Journey.
The Query Mechanism
Big Data queries are based on a distributed processing mechanism called MapReduce that provides optimum
performance across the servers in a cluster. Figure 3 shows a high-level overview of an example of this process
that sums the values of the properties of each data item that has a key value of A.


Figure 3
A high level view of the Big Data process for storing data and extracting information

For a detailed description of the MapReduce framework, see MapReduce.org and Wikipedia.
The query uses two components, usually written in Java, which implement algorithms that perform a two-stage
data extraction and rollup process. The Map component runs on each data node server in the cluster extracting
data that matches the query, and optionally applying some processing or transformation to the data to acquire
the required result set from the files on that server.
The Reduce component runs on one of the data node servers, and combines the results from all of the Map
components into the final results set.
Depending on the configuration of the query job, there may be more than one Reduce task running. The output
from each Map task is stored in a common buffer, sorted, and then passed to one or more Reduce tasks.
Intermediate results are stored in the buffer until the final Reduce task combines them all.
Although the core Hadoop engine requires the Map and Reduce components it executes to be written in Java,
you can use other techniques to create them in the background without writing Java code. For example you can
use the tools named Hive and Pig that are included in most Big Data frameworks to write queries in a SQL-like or
a high-level language. You can also use the Hadoop streaming API to execute components written in other
languagessee Hadoop Streaming on the Apache website for more details.
The following chapters of this guide contain more details of the way that Big Data solutions work, and guidance
on using them. For example, they discusses topics such as choosing a data store, loading the data, the common
techniques for querying data, and optimizing performance. The guide also describes the end-to-end process for
using Big Data frameworks, and patterns such as ETL (extract-transform-load) for generating BI information.
What Are the Limitations of Big Data Solutions?
Modern relational databases are typically optimized for fast and efficient query processing using techniques
such as Structured Query Language (SQL), whereas Big Data solutions are optimized for reliable storage of vast
quantities of data. The typically unstructured nature of the data, the lack of predefined schemas, and the
distributed nature of the storage in Big Data solutions often preclude major optimization for query performance.
Unlike SQL queries, which can use intelligent optimization techniques to maximize query performance, Big Data
queries typically require an operation similar to a table scan.
Therefore, Big Data queries are typically batch operations that, depending on the data volume and query
complexity, may take considerable time to return a final result. However, when you consider the volumes of
data that Big Data solutions can handle, the fact that queries run as multiple tasks on distributed servers does
offer a level of performance that cannot be achieved by other methods. Unlike most SQL queries used with
relational databases, Big Data queries are typically not executed repeatedly as part of an applications execution,
and so batch operation is not a major disadvantage.

Big Data queries are batch operations that may take some time to execute. Its possible to perform real-
time queries, but typically you will run the query and store the results for use within your existing BI tools and
analytics systems.
Will a Big Data Solution Replace My Relational Database?
The relational database systems we use today and the more recent Big Data solutions are complementary
mechanisms; Big Data is unlikely to replace the existing relational database. In fact, in the majority of cases, it
complements and augments the capabilities for managing data and generating BI.


Figure 4
Combining Big Data with a relational database
For example, it's common to use a Big Data query to generate a result set that is then stored in a relational
database for use in the generation of BI, or as input to another process. Big Data is also a valuable tool when you
need to handle data that is arriving very quickly, and which you can process later. You can dump the data into
the Big Data storage cluster in its original format, and then process it on demand using a Big Data query that
extracts the required result set and stores it in a relational database, or makes it available for reporting. Figure 4
shows this approach in schematic form.
Chapter 2 of this guide explores in more depth and demonstrates the integration of a Big Data solution with
existing data analysis tools and enterprise BI systems. It describes the three main stages of integration, and
explains the benefits of each approach in terms of maximizing the usefulness of your Big Data implementations.
Is Big Data the Right Solution for Me?
The first step in evaluating and implementing any business policy, whether its related to computer hardware,
software, replacement office furniture, or the contract for cleaning the windows, is to determine the results that
you hope to achieve. Adopting a Big Data solution is no different.
The result you want from a Big Data solution will typically be better information that helps you to make data-
driven decisions about the organization. However, to be able to get this information, you must evaluate several
factors such as:
Where will the source data come from? Perhaps you already have the data that contains the
information you need, but you cant analyze it with your existing tools. Or is there a source of data you
think will be useful, but you dont yet know how to collect it, store it, and analyze it?
What is the format of the data? Is it highly structured, in which case you may be able to load it into
your existing database or data warehouse and process it there? Or is it semi-structured or unstructured,
in which case a Big Data solution that is optimized for textual discovery, categorization, and predictive
analysis will be more suitable?
What are the delivery and quality characteristics of the data? Is there a huge volume? Does it arrive as
a stream or in batches? Is it of high quality or will you need to perform some type of data cleansing and
validation of the content?
Do you want to combine the results with data from other sources? If so, do you know where this data
will come from, how much it will cost if you have to purchase it, and how reliable this data is?
Do you want to integrate with an existing BI system? Will you need to load the data into an existing
database or data warehouse, or will you just analyze it and visualize the results separately?

The answers to these questions will help you decide whether a Big Data solution is appropriate. As you saw
earlier in this chapter, Big Data solutions are primarily suited to situations where:
You have large volumes of data to store and process.
The data is in a semi-structured or unstructured format, often as text files or binary files.
The data is not well categorized; for example, similar items are described using different terminology
such as a variation in city, country, or region names, and there is no obvious key value.
The data arrives rapidly as a stream, or in large batches that cannot be processed in real time, and so
must be stored efficiently for processing later as a batch operation.
The data contains a lot of redundancy or duplication.
The data cannot easily be processed into a format that suits existing database schemas without risking
loss of information.
You need to execute complex batch jobs on a very large scale, so that running the jobs in parallel is
necessary.
You want to be able to easily scale the system up or down on demand.
You dont actually know how the data might be useful, but you suspect that it will be - either now or in
the future.

If you decide that you do need a Big Data solution, the next step is to evaluate and choose a platform. There are
several you can choose from, some of which are delivered as cloud services and some that you run on your own
on-premises or hosted hardware. In this guide you will see solutions implemented using Microsofts cloud
hosted solution, Windows Azure HDInsight.
Choosing a cloud-hosted Big Data platform makes it easy to scale out or scale in your solution without
incurring the cost of new hardware or have existing hardware underused.
What Is Microsoft HDInsight?
Microsoft implements a Big Data solution through an implementation based on Hadoop, and called HDInsight.
HDInsight is 100% Apache Hadoop compatible and built on open source components in conjunction with
HortonWorks; and is compatible with open source community distributions. All of the components are tested in
typical scenarios to ensure that they work together correctly, and that there are no versioning or compatibility
issues. Developments in HDInsight are fed back into community through HortonWorks to maintain compatibility
and to support the open source effort.
At the time this guide was written HDInsight is available in two forms:
Windows Azure HDInsight. This is a service available to Windows Azure subscribers that uses Windows
Azure clusters, and integrates with Windows Azure storage. An ODBC connector is available to connect
the output from HDInsight queries to data analysis tools.
Microsoft HDInsight Developer Preview. This is a single node development environment framework
that you can install on most versions of Windows.

You can obtain a full multi-node version of Hadoop from HortonWorks that runs on Windows Server. For more
information see Microsoft + Hortonworks.
This guide is based on the May 2013 release of HDInsight on Windows Azure. Later releases
of HDInsight may differ from the version described in this guide. For more information, see
Microsoft Big Data. To sign up for the Windows Azure service, go to HDInsight service home
page.
Figure 5 shows the main component parts of HDInsight.


Figure 5
The main component parts of Microsoft HDInsight
For an up to date list of the components included with HDInsight and the specific versions, see What version of
Hadoop is in Windows Azure HDInsight?
HDInsight supports queries and MapReduce components written in C# and JavaScript; and also includes a
dashboard for monitoring clusters and creating alerts.
The locally installable package that you can use for developing, debugging, and testing your Big Data solutions
runs on a single Windows computer and integrates with Visual Studio. You can also download the Microsoft
.NET SDK For Hadoop from Codeplex or install it with NuGet. At the time of writing this contained a .NET
MapReduce implementation, Linq To Hive, PowerShell cmdlets, plus the web client applications WebHDFS and
WebHCat, and clients for Oozie and Ambari.
For local development of HDInsight solutions you can download and install a single node version of
HDInsight Server that will run on most versions of Windows. This allows you to develop and test your solutions
locally before you upload them to Windows Azure HDInsight. You can download the Microsoft HDInsight
Developer Preview version from the Microsoft Download Center.
The Data Store and Execution Platform
Big Data solutions store data as a series of files located within a familiar folder structure on disk. In Windows
Azure HDInsight these files are stored in Windows Azure blob storage. In the developer preview version the files
are stored on the local computers disk in a virtual store.
The execution platform is common across both Windows Azure HDInsight and the developer preview version,
and so working with it is a similar experience. The Windows Azure HDInsight portal and the developer preview
desktop provide an interactive environment in which you run the tools such as Hive and Pig.
You can use the Sqoop connector to import data from a SQL Server database or Windows Azure SQL Database
into your cluster. Alternatively you can upload the data using the JavaScript interactive console, a command
window, or any other tools for accessing Windows Azure blob storage.
Chapter 3 of this guide provides more details of how you can upload data to an HDInsight cluster.
The Query and Analysis Tools
HDInsight contains implementations of the Hive and Pig tools. Hive allows you to overlay a schema onto the data
when you need to run a query, and use a SQL-like language for these queries. For example, you can use the
CREATE TABLE command to build a table by splitting the text strings in the data using delimiters or at specific
character locations, and then execute SELECT statements to extract the required data.
Pig allows you to create schemas and execute queries using Java or by writing queries in a high level language
called Pig Latin. It supports a fluent interface in the JavaScript console, which means that you can write queries
such as from("input.txt", "date, product, qty, price").select("product, price").run() .
HDInsight also supports writing map and reduce components that directly execute within the Hadoop core. As
an alternative to using Java to create these components you can use the Hadoop streaming interface to execute
map and reduce components written in C# and F#, and you can use JavaScript in Pig queries.
You can, of course, use other open source tools such as Mahout with HDInsight, even though they are not
included as part of an HDInsight deployment. Mahout allows you to perform data mining queries that examine
data files to extract specific types of information. For example, it supports recommendation mining (finding
users preferences from their behavior), clustering (grouping documents with similar topic content), and
classification (assigning new documents to a category based on existing categorization).
For more information about using HDInsight, a good place to start is the Microsoft TechNet library. You can see
a list of articles related to HDInsight by searching the library using this URL:
http://social.technet.microsoft.com/Search/en-US?query=hadoop.
Blogs and forums for HDInsight include http://prologika.com/CS/blogs/blog/archive/tags/Hadoop/default.aspx
and http://social.msdn.microsoft.com/Forums/en-US/hdinsight/threads.
The remaining chapters of this guide provide more details of how you can perform queries and data analysis.
HDInsight and the Microsoft Data Platform
It may appear from what youve seen so far that HDInsight is a separate, stand-alone system for storing and
analyzing large volumes of data. However, this is definitely not the caseit is a significant component of the
Microsoft data platform; and part of the overall data acquisition, management, and visualization strategy.

Figure 6
The Role of HDInsight as part of the Microsoft Data Platform
Figure 6 shows an overview of the Microsoft data platform, and the role HDInsight plays within this. This figure
does not include all of Microsofts data-related products, and it doesnt attempt to show physical data flows.
Instead, it illustrates the applications, services, tools, and frameworks that work together allow you to capture
data, store it, and visualize the information it contains.As an example, consider the case where you want to
simply load some unstructured data into Windows Azure HDInsight, combine it with data from external sources
such as Windows Data Market, and then analyze and visualize the results using Microsoft Excel and Power View.
Figure 7 shows this approach, and how it maps to elements of the Microsoft data platform.

Figure 7
Simple data querying, analysis, and visualization using Windows Azure HDInsight
However, you may decide that you want to implement deeper integration of the data you store and query in
HDInsight with your enterprise BI system. Figure 8 shows an example where streaming data collected from
device sensors is fed through StreamInsight for categorization and filtering, and then dumped into a Windows
Azure HDInsight cluster. The output from queries that are run as a periodic batch jobs in HDInsight is integrated
at the corporate data model level with a data warehouse, and ultimately delivered to users through SharePoint
libraries and web parts, making it available for use in reports and in data analysis and visualization tools such as
Microsoft Excel.

Figure 8
Integrating HDInsight with corporate models in an enterprise BI system
Alternatively, you may want to integrate HDInsight as a business data source for an existing data warehouse
system, perhaps to produce a set of specific management reports on a regular basis. Figure 9 shows how you
can take semi-structured or unstructured data and query it with HDInsight, validate and cleanse the results using
Data Quality Services, and store them in your data warehouse tables ready for use in reports.

Figure 9
Querying, validating, cleansing, and storing data for use in reports
These are just three examples of the countless permutations and capabilities of the Microsoft data platform and
HDInsight. Your own requirements will differ, but the combination of services and tools makes it possible to
implement almost any kind of Big Data solution using the elements of the platform. In the following chapters of
this guide you will see many examples of the way that these tools and services work together.
In the following chapters you will learn about the technical implementation of a Big Data
solution using HDInsight, including options for storing the data, the tools for performing
queries, and the technicalities of connecting the data to BI systems and visualization tools.
Where Do I Go Next?
The remainder of this guide explores Big Data and HDInsight from several different perspectives. You may wish
to read the chapters in order to see how to design, implement, manage, and automate a Big Data solution.
However, if you have a particular area of interest, you may prefer to jump directly to the appropriate chapter.
Use the following table to help you decide.
Topic area Chapter
Architecture and design of a Big Data solution, and an overview of
the development life cycle.
Chapter 2, Getting Started with HDInsight
Patterns and implementations for loading data into a Big Data
solution, including techniques for data compression.
Chapter 3, Collecting and Loading Data
Using the query tools, map/reduce components, user-defined
functions; evaluating the results; and performance tuning your
queries.
Chapter 4, Performing Queries and Transformations
Loading and using the results of an HDInsight process in
visualization and analysis tools, data stores, and BI systems.
Chapter 5 Consuming and Visualizing the Results
Debugging and testing Big Data queries, and using tools and
frameworks to automate execution of HDInsight processes.
Chapter 6, Debugging, Testing, and Automating
HDInsight Solutions
Using the Hadoop and HDInsight monitoring tools; and managing
and automating HDInsight administrative tasks.
Chapter 7 Managing and Monitoring HDInsight
Using HDInsight as a tool to explore data and find previously
unknown information and insights.
Chapter 8, Scenario 1: Iterative Data Exploration
Using HDInsight to perform Extract, Transform, and Load
operations to validate and cleanse incoming data.
Chapter 9, Scenario 2: Extract, Transform, Load
Using HDInsight as a commodity data store or as an online data
warehouse to minimize on-premises costs.
Chapter 10, Scenario 3: HDInsight as a Data
Warehouse
Integrating an HDInsight solution with an enterprise BI system at
each of the three typical integration levels.
Chapter 11, Scenario 4: Enterprise BI Integration
Implementing the three common patterns for data ingestion in a Big
Data solution.
Appendix A, Data Ingestion Patterns using the Twitter
API
Overview of the structure and usage of an enterprise BI system, as
background to the scenarios in the guide.
Appendix B, Enterprise BI Integration
Summary
In this introductory chapter you have discovered what a Big Data solution is, why you may want to implement
one, and how it is different from the use of traditional relational databases and business intelligence tools. You
have seen an overview of how Big Data solutions work, and a brief description of the different frameworks and
platforms you can choose from.
This guide focuses on Microsofts cloud based Big Data implementation, called Windows Azure HDInsight.
HDInsight uses a 100% compatible open source implementation of the core Hadoop library and tools, as well as
providing additional capabilities for writing queries in a range of languages.
Finally, the chapter introduced some of the typical scenarios for using a Big Data solution, and showed how Big
Data solutions that use HDInsight fit into the Microsoft data platform. The scenarios are covered in detail on
subsequent chapters of the guide. However, before then, the following chapter shows how to get started using
Windows Azure HDInsight.
More Information
The official site for Apache Big Data solutions and tools is the Apache Hadoop website.
For a detailed description of the MapReduce framework and programming model, see MapReduce.org and
Wikipedia.
For an overview and description of Microsoft HDInsight see Microsoft Big Data.
To sign up for the Windows Azure service, go to HDInsight Service.
You can download the HDInsight developer preview from the Microsoft Download Center.
The Microsoft .NET SDK For Hadoop is available from Codeplex or you can install it with NuGet.
The Microsoft TechNet library contains articles related to HDInsight. Search for these using the URL
http://social.technet.microsoft.com/Search/en-US?query=hadoop.
The official support forum for HDInsight is at http://social.msdn.microsoft.com/Forums/en-
US/hdinsight/threads.
There are also several popular blogs that cover Big Data and HDInsight topics:
http://prologika.com/CS/blogs/blog/archive/tags/Hadoop/default.aspx
http://dennyglee.com/
http://blogs.msdn.com/b/hpctrekker/


Chapter 2: Getting Started with HDInsight
In the previous chapter you learnt how Big Data solutions such as Microsoft HDInsight can help you discover
vital information that may otherwise have remained hidden in your dataor even lost forever. This information
can help you to evaluate your organizations historical performance, discover new opportunities, identify
operational efficiencies, increase customer satisfaction, and even predict likely outcomes for the future. Its not
surprising that Big Data is generating so much interest and excitement in what some may see as the rather
boring world of data management!
In this chapter the focus moves towards the practicalities of implementing a Big Data solution by considering in
more detail how you go about this. This means you need to think about what you hope to achieve from the
implementation, even if your aim is just to explore the kinds of information you might be able to extract from
available data, and how HDInsight fits into your existing business infrastructure. It may be that you just want to
use it alongside your existing business intelligence (BI) systems, or you may want to deeply integrate it with
these systems. The important point is that, irrespective of how you choose to use it, the end result is the same:
some kind of analysis of the source data and meaningful visualization of the results.
This chapter begins with an overview of the main stages of implementing a Big Data solution. It then explores
the decisions you should make before you start, such as identifying your goals and the source data you might
use, before moving on to describe the basic architectural approaches that you might choose to implement.

Many organizations already use data to improve decision making through existing BI solutions that
analyze data generated by business activities and applications, and create reports based on this analysis. Rather
than seeking to replace traditional BI solutions, HDInsight provides a way to extend the value of your investment
in BI by enabling you to incorporate a much wider variety of data sources that complement and integrate with
existing data warehouse, analytical data models, and business reporting solutions.
An Overview of the Big Data Process
Big Data implementation typically involves a common series of stages, irrespective of the type of data and the
ultimate aims for obtaining information from that data. The stages are:
Determine the analytical goals and source data. Before you start any data analysis project, it is useful to
be clear about what it is you hope to achieve. You may have a specific question that you need to answer
in order to make a critical business decision; in which case you must identify data that may help you
determine the answer, where it can be obtained from, and if there are any costs associated with
procuring it. Alternatively, you may already have some data that you want to explore to try to discern
useful trends and patterns. Either way, understanding your goals will help you design and implement a
solution that best supports those goals.

Plan and configure the infrastructure. This involves setting up the Big Data software you have decided
to use, or subscribing to an online service such as Windows Azure HDInsight. You may also consider
integrating your solution with an existing database or BI infrastructure.
Obtain the data and submit it to the cluster. During this stage you decide how you will collect the data
you have identified as the source, and how you will get it into your Big Data solution for processing.
Often you will store the data in its raw format to avoid losing any useful contextual information it
contains, though you may choose to do some pre-processing before storing it to remove duplication or
to simplify it in some other way.
Process the data. After you have started to collect and store the data, the next stage is to develop the
processing solutions you will use to extract the information you need. While you can usually use Hive
and Pig queries for even quite complex data extraction, you will occasionally need to create map/reduce
components to perform more complex queries against the data.
Evaluate the results. Probably the most important step of all is to ensure that you are getting the results
you expected, and that these results make sense. Complex queries can be hard to write, and difficult to
get right the first time. Its easy to make assumptions or miss edge cases that can skew the results quite
considerably. Of course, it may be that you dont know what the expected result actually is (after all, the
whole point of Big Data is to discover hidden information from the data) but you should make every
effort to validate the query results before making business decisions based on them. In many cases, a
business user who is familiar enough with the business context can perform the role of a data steward
and review the results to verify that they are meaningful, accurate, and useful.
Tune the solution. At this stage, if the solution you have created is working correctly and the results are
valuable, you should decide whether you will repeat it in the future; perhaps with new data you collect
over time. If so, you should tune the solution by reviewing the log files it creates, the processing
techniques you use, and the implementation of the queries to ensure that they are executing in the
most efficient way. Its possible to fine tune Big Data solutions to improve performance, reduce network
load, and minimize the processing time by adjusting some parameters of the query and the execution
platform, or by compressing the data that is transferred over the network.
Visualize and analyze the results. Once you are satisfied that the solution is working correctly and
efficiently, you can plan and implement the analysis and visualization approach you require. This may be
loading the data directly into an application such as Microsoft Excel, or exporting it into a database or
enterprise BI system for further analysis, reporting, charting, and more.
Automate the solution. At this point is will be clear if the solution should become part of your
organizations business management infrastructure, complementing the other sources of information
that you use to plan and monitor business performance and strategy. If this is the case, you should
consider how you might automate some or all of the solution to provide predictable behavior, and
perhaps so that it is executed on a schedule.

Administer the solution. Any business process on which your business depends should be monitored for
errors and failures, and a Big Data solution is no different. After a solution becomes an integral part of
your organizations business management infrastructure, you should put in place systems and
procedures that allow administrators to easily manage and monitor it.


Its easy to run queries that extract data, but its vitally important that you make every effort to validate
the results before using them as the basis for business decisions. If possible you should try to cross reference the
results with other sources of similar information.
Integrating a Big Data Process with Your Existing Systems
If you find that the Big Data process you have implemented does produce a meaningful, valid, and useful result
the next stage is to consider whether you want to integrate it with your existing data management and
reporting systems. Unless you intend to carry out the process manually each time, integration typically means
Figure 1
The stages of a Big Data solution development
and map to the chapters of this guide

Figure 1 shows how these development stages
map to the chapters of this guide.
Note that, in many ways, data analysis is an
iterative process; and you should take this
approach when building a Big Data solution.
In particular, given the large volumes of data and
correspondingly long processing times typically
involved in Big Data analysis, it can be useful to
start by implementing a proof of concept iteration
in which a small subset of the targeted data is
used to validate the processing steps and results
before proceeding with a full analysis.
This enables you to test your Big Data processing
design on a small HDInsight cluster or a single-
node on-premise solution before scaling out to
accommodate production-level data volumes.


automating some or all of the data loading, querying, and export stages. There are many tools, frameworks, and
other techniques you can use to accomplish this. While the guide cannot cover every eventuality, you will find
more information in Chapter 6 to help you to achieve this.
In addition, after a Big Data solution becomes an integral part of your internal information systems, you must
consider how you will monitor and manage it to ensure that it continues to work correctly and efficiently. For
example, you must consider how you will detect failures, and how you will ensure that the solution is secure and
reliable. Chapter 7 of this guide explores these concepts.
Determining Analytical Goals and Source Data
Before embarking on a Big Data project it is generally useful to think about what you hope to achieve, and
clearly define the analytical goals of the project. In some projects there may be a specific question that the
business wants to answer, such as where should we open our new store? In other projects the goal may be
more open-ended; for example to examine website traffic and try to detect patterns and trends in visitor
numbers. Understanding the goals of the analysis can help you make decisions about the design and
implementation of the solution, including the specific technologies to use and the level of integration with
existing BI infrastructure.
Historical and Predictive Data Analysis
Most organizations already collect data from multiple sources. These might include line of business applications
such as websites, accounting systems, and office productivity applications. Other data may come from
interaction with customers, such as sales transactions, feedback, and reports from company sales staff. The data
is typically held in one or more databases, and is often consolidated into a specially designed data warehouse
system.
Having collected this data, organizations typically use it to perform the following kinds of analysis and reporting.
Historical analysis and reporting, which is concerned with summarizing data to make sense of what
happened in the past. For example, a business might summarize sales transactions by fiscal quarter and
sales region, and use the results to create a report for shareholders. Additionally, business analysts
within the organization might explore the aggregated data by drilling down into individual months to
determine periods of high and low sales revenue, or drilling down into cities to find out if there are
marked differences in sales volumes across geographic locations. The results of this analysis can help to
inform business decisions, such as when to conduct sales promotions or where to open a new store.
Predictive analysis and reporting, which is concerned with detecting data patterns and trends to
determine whats likely to happen in the future. For example, a business might use statistics from
historical sales data and apply it to known customer profile information to predict which customers are
most likely to respond to a direct-mail campaign, or which products a particular customer is likely to
want to purchase. This analysis can help improve the cost-effectiveness of a direct-mail campaign, or
increase sales while building closer customer relationships through relevant targeted recommendations.

Both kinds of analysis and reporting involve taking source data, applying an analytical model to that data, and
using the output to inform business decision making. In the case of historical analysis and reporting, the model is
usually designed to summarize and aggregate a large volume of data to determine meaningful business
measures (for example the total sales revenue aggregated by various aspects of the business, such as fiscal
period and sales region).
For predictive analysis, the model is usually based on a statistical algorithm that categorizes clusters of similar
data or correlates data attributes that influence one another to cause trends (for example, classifying customers
based on demographic attributes, or identifying a relationship between customer age and the purchase of
specific products).
Databases are the core of most organizations data processing, and in most cases the purpose is simply to
run the operation by, for example, storing and manipulating data to manage stock and create invoices.
However, analytics and reporting is one of the fastest growing sectors in business IT as managers strive to learn
more about their business.
Analytical Goals
Although every project has its own specific requirements, Big Data projects generally fall into one of the
following categories:
One-time analysis for a specific business decision. For example, a company planning to expand by
opening a new physical store might use Big Data techniques to analyze demographic data for a shortlist
of proposed store sites to determine the location that is likely to result in the highest revenue for the
store. Alternatively, a charity planning to build water supply infrastructure in a drought-stricken area
might use a combination of geographic, geological, health, and demographic statistics to identify the
best locations to target.
Open blue sky exploration of interesting data. Sometimes the goal of Big Data analysis is simply to
find out what you dont already know from the available data. For example, a business might be aware
that customers are using Twitter to discuss its products and services, and want to explore the tweets to
determine if any patterns or trends can be found relating to brand visibility or customer sentiment.
There may be no specific business decision that needs to be made based on the data, but gaining a
better understanding of how customers perceive the business might inform business decision making in
the future.
Ongoing reporting and BI. In some cases the Big Data solution will be used to support ongoing reporting
and analytics, either in isolation or integrated with an existing enterprise BI solution. For example, a real
estate business that already has a BI solution, which enables analysis and reporting of its own property
transactions across time periods, property types, and locations, might be extended to include
demographic and population statistics data from external sources. Integration patterns for HDInsight
and enterprise BI solutions are discussed later in this chapter.

In many respects, data analysis is an iterative process. It is not uncommon for an initial project based on
open exploration of data to uncover trends or patterns that form the basis for a new project to support a specific
business decision, or to extend an existing BI solution.

The results of the analysis are typically consumed and visualized in the following ways.
Custom application interfaces. For example, a custom application might display the data as a chart or
generate a set of product recommendations for a customer.
Business performance dashboards. For example, you could use the PerformancePoint Services
component of SharePoint Server to display key performance indicators (KPIs) as scorecards and
summarized business metrics in a SharePoint Server site.
Reporting solutions such as SQL Server Reporting Services or Windows Azure SQL Reporting. For
example, business reports can be generated in a variety of formats and distributed automatically by
email or viewed on demand through a web browser.
Analytical tools such as Microsoft Excel. Information workers can explore analytical data models
through PivotTables and charts, and business analysts can use advanced Excel capabilities such as
PowerPivot and Power View to create their own personal data models and visualizations, or use add-ins
to apply predictive models to data and view the results in Excel.

You will see more about the tools for analytics and reporting later in this chapter. Before then you will explore
different ways of combining a Big Data mechanism such as HDInsight with your existing business processes and
data.
Source Data
In addition to determining the analytical goals of the project, you must identify sources of data that can be used
to meet those goals. Often you will already know which data sources need to be included in the analysis; for
example, if the goal is to analyze trends in sales for the past three years you can use historic sales data from
internal business applications or a data warehouse. However, in some cases you may need to search for data to
support the analysis you want to perform; for example, if the goal is to determine the best location in which to
open a new store you may need to search for useful demographic data that covers the locations under
consideration.
It is common in Big Data projects to combine data from multiple sources and create a mash up that enables
you to analyze many different aspects of the problem in a single solution. For example, you might combine
internal historic sales data with geographic data obtained from an external source to plot sales volumes on a
map. You may then overlay the map with demographic data to try to correlate sales volume with particular geo-
demographic attributes.

Common types of data source used in a Big Data solution include:
Internal business data from existing applications or BI solutions. Often this data is historic in nature or
includes demographic profile information that the business gathered from its customers. For example,
you might use historic sales records to correlate customer attributes with purchasing patterns, and then
use this information to support targeted advertising or predictive modeling of future product plans.
Log files. Applications or infrastructure services often generate log data that can be useful for analysis
and decision making with regard to managing IT reliability and scalability. Additionally, in some cases,
combining log data with business data can reveal useful insights into how IT services support the
business. For example, you might use log files generated by Internet Information Services (IIS) to assess
network bandwidth utilization, or to correlate web site traffic with sales transactions in an ecommerce
application.
Sensors. Increased automation in almost every aspect of life has led to a growth in the amount of data
recorded by electronic sensors. For example, RFID tags in smart cards are now routinely used to track
passenger progress through mass transit infrastructure, and sensors in plant machinery generate huge
quantities of data in production lines. This availability of data from sensors enables highly dynamic
analysis and real-time reporting.
Social media. The massive popularity of social media services such as Twitter and others is a major
factor in the growth of data volumes on the Internet. Many social media services provide application
programming interfaces (APIs) that you can use to query the huge amount of data shared by users of
these services, and consume that data for analysis. For example, a business might use Twitters query
API to find tweets that mention the name of the company or its products, and analyze the data to
determine how customers feel about the companys brand.
Data feeds. Many web sites and services provide data as a feed that can be consumed by client
applications and analytical solutions. Common feed formats include RSS, ATOM, and industry defined
XML formats; and the data sources themselves include blogs, news services, weather forecasts, and
financial markets data.
Government and special interest groups. Many government organizations and special interest groups
publish data that can be used for analysis. For example, the UK government publishes over 9000
downloadable datasets in a variety of formats, including statistics on population, crime, government
spending, health, and other factors of life in the UK. Similarly, the US government provides census data
and other statistics as downloadable datasets or in dBASE format on CD-ROM. Additionally, many
international organizations provide data free of charge for example, the United Nations makes
statistical data available through its own website and in the Windows Azure DataMarket.
Commercial data providers. There are many organizations that sell data commercially, including
geographical data, historical weather data, economic indicators, and others. The Windows Azure
DataMarket provides a central service through which you can locate and purchase subscriptions to
these data sources.

Just because data is available doesnt mean it is useful, or that the effort of using it will be viable. Think
about the value the analysis can add to your business before you devote inordinate time and effort to collecting
and analyzing it.

When planning data sources to use in your Big Data solution, consider the following factors:
Availability. How easy is it to find and obtain the data? You may have a specific analytical goal in mind,
but if the data required to support the analysis is difficult (or impossible) to find you may waste valuable
time trying to obtain it. When planning a Big Data project, it can be useful to define a schedule that
allows sufficient time to research what data is available. If the data cannot be found after an agreed
deadline you may need to revise the analytical goals.
Format. In what format is the data available, and how can it be consumed? Some data is available in
standard formats and can be downloaded over a network or Internet API. In other cases the data may
be available only as a real-time stream that you must capture and structure for analysis. Later in the
process you will consider tools and techniques for consuming the data from its source and ingesting it
into your cluster, but even during this early stage you should identify the format and connectivity
options for the data sources you want to use.
Relevance. Is the data relevant to the analytical goals? You may have identified a potential data source
and already be planning how you will consume it and ingest it into the analytical process. However, you
should first examine the data source carefully and ensure that the data it contains is relevant to the
analysis that needs to be performed.
Cost. You may determine the availability of a relevant dataset, only to discover that the cost of
obtaining the data outweighs the potential business benefit of using it. This can be particularly true if
the analytical goal is to augment an enterprise BI solution with external data on an ongoing basis, and
the external data is only available through a commercial data provider.

Planning and Configuring the Infrastructure
Having identified the data sources you need to analyze, you must design and configure the infrastructure
required to support the analysis. Considerations for planning an infrastructure for Big Data analysis include
deciding whether to use an on-premises or a cloud-based implementation, understanding how file storage
works, and how you can integrate the solution with an existing enterprise BI infrastructure if this is a
requirement.
This guide concentrates on Windows Azure HDInsight. However, the considerations for choosing an
infrastructure option are typically relevant irrespective of the Big Data mechanism you choose.
Choosing an Operating Location
Big Data solutions are available as both cloud-hosted services and on-premises installable solution. For example,
HDInsight is available as a Windows Azure service, but you can install the HortonWorks Hadoop distribution as
an on-premises solution on Windows Server. Consider the following guidelines when choosing between these
options:
A cloud-hosted mechanism such as HDInsight on Windows Azure is a good choice when:
You require the solution to be running for only a specific period of time.
You require the solution to be available for ongoing analysis, but the workload will vary - sometimes
requiring a cluster with many nodes and sometimes not requiring any service at all.
You want to avoid the cost in terms of capital expenditure, skills development, and the time it takes to
provision, configure, and manage an on-premises solution.

An on-premises solution such as Hadoop on Windows Server is a good choice when:
You require ongoing services with a predictable and constant level of scalability.
You have the necessary technical capability and budget to provision, configure, and manage your own
cluster.
The data you plan to analyze must remain on your own servers for compliance or confidentiality
reasons.

If you want to run a full multi-cluster Hadoop installation on Windows Server you should consider the Hadoop
distribution available from HortonWorks. This is the core Hadoop implementation used by HDInsight. For single
node development and testing of Hadoop solutions on Windows you can download the Microsoft HDInsight
Developer Preview and install it on most versions of Windows.
Data Storage in Windows Azure HDInsight
HDInsight on Windows Azure stores data in Azure Storage Vault (ASV) containers in Windows Azure blob
storage. However, it supports Hadoop file system commands and processes through a fully HDFS-compliant
layer over the ASV. As far as Hadoop is concerned, storage operates in exactly the same way as when using a
physical HDFS implementation. The only real difference is that you can access the storage in the ASV using
standard Windows Azure blob storage techniques as we as through the HDFS layer.
It may seem as though using a separate storage mechanism from the HDFS cluster instances defeats the object
of Hadoop, where one of the main tenets is that the data does not need to be moved because it is distributed
amongst and stored on each server in the cluster. However, Windows Azure datacenters use flat network
storage, which provides extremely fast, high bandwidth connectivity between storage and the virtual machines
that make up an HDInsight cluster.

Using Windows Azure blob storage provides several advantages:
Data storage costs are minimized because you can decommission a cluster when not performing
queries; data in Windows Azure blob storage is persisted when the HDInsight cluster is released and you
can build a new cluster on the existing source data in blob storage. Windows Azure blob storage is also
considerably cheaper than the cost of storage attached to the HDInsight cluster members.
Data transfer costs are minimized because you do not have to upload the data again over the Internet
when you recreate a cluster that uses the same data; there is no transfer cost when loading an HDFS
cluster from blob storage in the same datacenter.
By default, when you delete and recreate a cluster, you will lose all of the metadata associated with
the cluster such as Hive external table and index definitions. See the section Retaining Data and
Metadata for a Cluster in Chapter 7 for information about how you can preserve and reinstate this
metadata.
Data stored in Windows Azure blob storage can be accessed by and shared with other applications and
services, whereas data stored in HDFS can only be accessed by HDFS-aware applications that have
access to the cluster storage.
Data in Windows Azure blob storage is replicated across three locations in the datacenter, so it provides
a similar level of redundancy to protect against data loss as an HDFS cluster. Blob storage can be used to
store large volumes of data without being concerned about scaling out storage in a cluster, or changing
the scaling in response to changes in storage requirements.
The high speed datacenter network provides fast access between the virtual machines in the cluster, so
data movement that occurs between nodes during the reduce stage is very efficient.
Tests carried out by the Windows Azure team indicate that blob storage provides near identical
performance to HDFS when reading data. However, Windows Azure blob storage provides faster write
access than using HDFS to store data on the nodes in the cluster because write operations complete as
soon as the single write operation back to the blob has completed. When using HDFS with data stored in
the cluster, a write operation is not complete until all of the replicas (by default there are three) have
completed.

However, its important to note that these advantages only apply when you locate the HDInsight cluster in the
same datacenter as your ASV. If you choose different datacenters you will severely hamper performance, and
you will also be charged for data transfer costs between datacenters.

For more information about the use of ASV instead of HDFS for data storage see the blog post Why use Blob
Storage with HDInsight on Azure. For details of the flat network storage see Windows Azures Flat Network
Storage and 2012 Scalability Targets.

By default, when you configure an HDInsight cluster, it uses a storage account located in the same
datacenter. If this was not the case, your queries would be extremely inefficient and you would also be charged
data transfer costs! Consider turning off the default geo-replication feature for the storage account unless the
data is very difficult to collect or recreate and you need this redundancy. If the processing fails over to a remote
storage account, query efficiency will be severely affected and you will incur additional transfer costs moving the
data between datacenters.

Configuring an HDInsight Cluster
You access Windows Azure HDInsight through the Windows Azure Management Portal at
https://manage.windowsazure.com/. The first step is to create a new HDInsight cluster and, as with most other
Windows Azure services, you have two options:
Quick Create allows you to create a cluster simply by specifying a unique name for the cluster (which
will be available at your-name.azurehdinsight.net), the cluster size (the default is four nodes), a
password for the cluster (the default user name is admin), and an existing Windows Azure storage
account in which HDInsight will store the cluster data.
Custom Create allows you to specify a different user name from the default of admin, and select a
container within your storage account that holds the source data for the cluster.

You dont need to perform any other configuration before you start using the cluster. However, there are many
configuration settings that you can change to modify the way that the cluster works and to fine tune
performance of your queries. For example, you can specify the number of map and reduce nodes for a query
and apply compression to the data in a script, by adding parameters to the command that runs the query, or by
editing the cluster configuration files.

For more information about setting configuration parameters see Chapter 7, Managing and Monitoring
HDInsight, of this guide.

Choosing an Architectural Style
Big Data solutions open up new opportunities for converting data into information. They can also be used to
extend existing information systems to provide additional insights through analytics and data visualization. Every
organization is different, and so there is no definitive list of ways that you can use a Big Data solution as part of
your own business processes. However, there are four general architectural models. Understanding these will
help you to start making decisions on how best to integrate Big Data solutions such as HDInsight with your
organization, and with your existing BI systems and tools. The four different models are:
As a standalone data analysis experimentation and visualization tool. This model is typically chosen for
experimenting with data sources to discover if they can provide useful information, and for handling
data that you cannot process using existing systems. For example, you might collect feedback from
customers through email, web pages, or external sources such as social media sites, then analyze it to
get a picture of user sentiment for your products. You might be able to combine this information with
other data, such as demographic data that indicates population density and characteristics in each city
where your products are sold.
As a data transfer, data cleansing, or ETL mechanism. Big Data solutions can be used to extract and
transform data before you load it into your existing databases or data visualization tools. Such solutions
are well suited to performing categorization and normalization of data, and for extracting summary
results to remove duplication and redundancy. This is typically referred to as an Extract, Transform, and
Load (ETL) process.
As a basic data warehouse or commodity storage mechanism. Big Data solutions allow you to store
both the source data and the results of queries executed over this data. You can also store schemas (or,
to be precise, metadata) for tables that are populated by the queries you execute. These tables can be
indexed, although there is no formal mechanism for managing key-based relationships between them.
However, you can create data repositories that are robust and reasonably low cost to maintain, which is
especially useful if you need to store and manage huge volumes of data.
Integration with an enterprise data warehouse and BI system. Enterprise-level data warehouses have
some special characteristics that differentiate them from OLTP database systems, and so there are
additional considerations for integrating with Big Data solutions. You can also integrate at different
levels, depending on the way that you intend to use the data obtained from your Big Data solution.

The four scenario chapters later in this guide, and the associated Hands-on Labs that you
can download from http://wag.codeplex.com/releases/view/103405, provide examples of
each of these architectural styles. Each scenario explores how a fictitious organization
implemented a solution using one of the architectural styles described in this chapter.
Each of these basic models encompasses several more specific use cases. For example, when integrating with an
enterprise data warehouse and BI system you need to decide at what level (report, corporate data model, or
data warehouse level) you will integrate the output from your Big Data solutions.
The following sections of this chapter explore the four models listed above in more detail, using HDInsight as the
Big Data mechanism. You will see advice on when to choose each one, how you can use it to provide the
information you need, and some of the typical use cases for each one.
Using HDInsight as a Standalone Experimentation and Visualization Tool
Traditional data storage and management systems such as data warehouses, data models, and reporting and
analytical tools provide a wealth of information on which to base business decisions. However, while traditional
BI works well for business data that can easily be structured, managed, and processed in a dimensional analysis
model, some kinds of analysis require a more flexible solution that can derive meaning from less obvious
sources of data such as log files, email messages, tweets, and others.
Theres a great deal of useful information to be found in these less structured data sources, which often contain
huge volumes of data that must be processed to reveal key data points. This kind of data processing is what Big
Data solutions such as HDInsight were designed to handle. It provides a way to process extremely large volumes
of unstructured or semi-structured data, often by performing complex computation and transformation of the
data, to produce an output that can be visualized directly or combined with other datasets.
If you do not intend to reuse the information from the analysis, but just want to explore the data, you may
choose to consume it directly in an analysis or visualization tool such as Microsoft Excel. The standalone
experimentation and visualization model is typically suited to the following scenarios:
Handling data that you cannot process using existing systems, perhaps by performing complex
calculations and transformations that are beyond the capabilities of existing systems to complete in a
reasonable time.
Collecting stream data that arrives at high rate or in large batches without affecting existing database
performance.
Collecting feedback from customers through email, web pages, or external sources such as social media
sites, then analyzing it to get a picture of customer sentiment for your products.
Combining information with other data, such as demographic data that indicates population density and
characteristics in each city where your products are sold.
Dumping data from your existing information systems so that you can work with it without interrupting
other business processes or risking corruption of the original data.
Trying out new ideas and validating processes before implementing them within the live system.


Combining your data with datasets available from Windows Azure Data Market or other commercial
data sources can reveal useful information that might otherwise remain hidden in your data.
Architecture Overview
Figure 2 shows an overview of the architecture for a standalone data analysis experimentation and visualization
solution using HDInsight. The source data files are loaded into the HDInsight cluster, processed by one or more
queries within HDInsight, and the output is consumed by the chosen reporting and visualization tools.


Figure 2
High-level architecture for the standalone experimentation and visualization model
Text files such as log files and compressed binary files can be loaded directly into the cluster storage, while
stream data will usually need to be handled by a suitable mechanism such as StreamInsight. The output data
may be combined with other datasets within your visualization and reporting tools to augment the information
and provide comparisons, as you will see later in this section of the chapter.
When using HDInsight as a standalone query mechanism, you will often do so as an interactive process. For
example, you might use the Data Explorer in Excel to submit a query to an HDInsight cluster and wait for the
results to be returned usually within a few seconds or even a few minutes. You can then modify and
experiment with the query to optimize the information it returns.
However, keep in mind that all Big Data operations are batch jobs that are submitted to all of the servers in the
cluster for parallel processing, and queries can often take minutes or hours to complete. For example, you might
use Pig to process an input file, with the results returned in an output file some time laterat which point you
can perform the analysis by importing this file into your chosen visualization tool.

Data Sources for the Standalone Experimentation and Visualization Model
The input data for HDInsight can be almost anything, but for this model they typically include the following:
Social data, log files, sensors, and applications that generate data files.
Datasets obtained from Windows Data Market and other commercial data providers.
Internal data extracted from databases or data warehouses for experimentation and one-off analysis.
Streaming data filtered and pre-processed through SQL Server StreamInsight.

Notice that, as well as externally obtained data, Figure 2 includes data from within your organizations existing
database or data warehouse. HDInsight is an ideal solution when you want to perform offline exploration of
existing data in a sandbox. For example, you may join several datasets from your data warehouse to create large
datasets that act as the source for some experimental investigation, or to test new analysis techniques. This
avoids the risk of interrupting existing systems, affecting performance of your data warehouse system, or
accidently corrupting the core data.
Often you need to perform more than one query on the data in HDInsight to get the results into the form
you need. Its not unusual to base queries on the results of a preceding query; for example, using one query to
select and transform the required data and remove redundancy, a second query to summarize the data returned
from the first query, and a third query to format the output as required. This iterative approach enables you to
start with a large volume of complex and difficult to analyze data, and get it into a structure that you can
consume directly from an analytical tool such as Excelor, as youll see later, to use as the input into a managed
BI solution.
Output Targets for the Standalone Experimentation and Visualization Model
The results from your HDInsight queries can be visualized using any of the wide range of tools that are available
for analyzing data, combining it with other datasets, and generating reports. Typical examples for the
standalone experimentation and visualization model are:
HDInsight interactive reporting and charting.
Interactive analytical tools such as Excel, PowerPivot, Power View, and GeoFlow.
SQL Server Reporting Services using Report Builder.
Custom or third party analysis and visualization tools.

You will see more details of these tools in Chapter 5, Consuming and Visualizing the Results, of this guide.

Considerations for the Standalone Experimentation and Visualization Model
There are some important points to consider when choosing the standalone model for querying and visualizing
data with HDInsight.
This model is typically used when you want to:
Experiment with new types or sources of data.
Generate one-off reports or visualizations of external or internal data.
Monitor a data source using visualizations to detect changes or to predict behavior.
Combine with other data to generate comparisons or to augment the information.
You will usually choose this model when you do not want to persist the results of the query after
analysis, or after the required reports have been generated. It is typically used for one-off analysis tasks
where the results are discarded after use; and so differs from the other models described in this
chapter, in which the results are stored and reused.
Very large datasets are likely to preclude the use of an interactive approach due to the time taken for
the queries to run. However, after the queries are complete you can work interactively with the data to
perform different types of analysis or visualization.
Data arriving as a stream, such as the output from sensors on an automated production line or the data
generated by GPS sensors in mobile devices cannot be analyzed directly by HDInsight. A typical
technique is to capture it using a stream processing technology such as StreamInsight, which may power
a real time visualization or rudimentary analysis tool, and also feed it into an HDInsight cluster where it
can be queried interactively or as a batch job at suitable intervals.
You are not limited to running a single query on the source data. You can follow an iterative pattern in
which the data is passed through HDInsight multiple times, each pass refining the data until it is suitably
prepared for use in your analytical tool. For example, a large unstructured file might be processed using
custom MapReduce components to generate a smaller, more structured output file. This output could
then be used in turn as the input for a Hive query that returns aggregated data in tabular form.

For an example of a scenario that uses the standalone experimentation and visualization
architectural style see Scenario 1, Iterative Data Exploration, in Chapter 8 of this guide.
Using HDInsight as a Data Transfer, Data Cleansing, or ETL Mechanism
In a traditional business environment the data to power your reporting mechanism will usually come from tables
in a database. However, its increasingly necessary to supplement this with data obtained from outside your
organization. This may be commercially available datasets, such as those available from Windows Data Market
and elsewhere, or it may be data from less structured sources such as social media, emails, log files, and more.
You will, in most cases, need to cleanse, validate, and transform this data before loading it into an existing
database. Extract, Transform, and Load (ETL) operations can use Big Data solutions such as HDInsight to perform
pattern matching, data categorization, de-duplication, and summary operations on unstructured or semi-
structured data to generate data in the familiar rows and columns format that can be imported into a database
table or a data warehouse.
The data transfer, data cleansing, or ETL model is typically suited to the following scenarios:
Extracting and transforming data before you load it into your existing databases or analytical tools.
Performing categorization and restructuring of data, and for extracting summary results to remove
duplication and redundancy.
Preparing data so that it is in the appropriate format and has appropriate content to power other
applications or services.

ETL may not seem like the most exciting process in the world of IT, but its the primary way that we can
ensure that the data is valid and contains correct values. ETL gets the data into the correct format for our
database tables, while data cleansing is the gatekeeper that protects our tables from invalid, duplicated, or
incorrect values.
Architecture Overview
Figure 3 shows an overview of the data transfer, data cleansing, or ETL model. Input data is transformed by
HDInsight to generate the appropriate output format and data content, and then imported into the target data
store, application, or reporting solution. Analysis and reporting can then be done against the data store, often by
combining the imported data with existing data in the data store. Alternatively, applications can use data in a
specific format for a variety of purposes.

Figure 3
High-level architecture for the data transfer, data cleansing, or ETL model
The transformation process in HDInsight may involve just a single query, but it is more likely to require a multi-
step process. For example, it might use custom Map and Reduce components or (more likely) Pig scripts,
followed by a Hive query in the final stage to generate the table format. However, the final format for import
into a data store may be something other than a table; for example, it may be a tab delimited file or some other
format suitable for import into the target data store.
Data Sources for the Data Transfer, Data Cleansing, or ETL Model
Data sources for this model are typically external data that can be matched on a key to existing data in your data
store so that it can be used to augment the results of analysis and reporting processes. Some examples are:
Social media data, log files, sensors, and applications that generate data files.
Datasets obtained from Windows Azure Data Market and other commercial data providers.
Streaming data filtered or processed through SQL Server StreamInsight.

Output Targets for the Data Transfer, Data Cleansing, or ETL Model
This model is designed to generate output that is in the appropriate format for the target data store. Common
types of data store are:
A database such as SQL Server or Windows Azure SQL Database.
A document or file sharing mechanism such as SharePoint server or other information management
systems.
A local or remote data repository in a custom format, such as JSON objects.
Cloud data stores such as Windows Azure table or blob storage.
Applications or services that require data to be processed into specific formats, or files that contain
specific types of information structure.

You may decide to use this approach even when you dont actually want to keep the
results of the HDInsight query. You can load it into your database, generate the reports
and analyses you require, and then delete the data from the database. You may need to
do this every time if the source data in HDInsight changes between each reporting cycle so
that just adding new data is not appropriate.
Considerations for the Data Transfer, Data Cleansing, or ETL Model
There are some important points to consider when choosing the data transfer, data cleansing, or ETL model.
This model is typically used when you want to:
Load stream data or large volumes of semi-structured or unstructured data from external
sources into an existing database or information system.
Cleanse, transform, and validate the data before loading it.
Generate reports and visualizations that are regularly updated.
Power other applications that require specific types of data, such as using an analysis of previous
behavioral information to apply personalization to an application or service.
When the output is in tabular format, such as that generated by Hive, the data import process can use
the Hive ODBC driver. Alternatively, you can use Sqoop (which is included in in the Hadoop distribution
installed by HDInsight) to connect a relational database such as SQL Server or Windows Azure SQL
Database to your HDInsight data store and export the results of a query into your database.
Some other connectors for accessing Hive data are available from Couchbase, Jaspersoft, and Tableau
Software.
If the target for the data is not a database, you can generate a file in the appropriate format within the
HDInsight query. This might be tab delimited format, fixed width columns, some other format for
loading into Excel or a third-party application, or even for loading into Windows Azure storage through a
custom data access layer that you create.
If the intention is to regularly update the target table or data store as the source data changes you will
probably choose to use an automated mechanism to execute the query and data import processes.
However, if it is a one-off operation you may decide to execute it interactively only when required.
Windows Azure table storage can be used to store table formatted data using a key to identify each
row. Windows Azure blob storage is more suitable for storing compressed or binary data generated
from the HDInsight query if you want to store it for reuse.

For an example of a scenario that uses the data transfer, data cleansing, or ETL architectural
style see Scenario 2, Extract, Transform, Load, in Chapter 9 of this guide.
Using HDInsight as a Basic Data Warehouse or Commodity Data Store
Big Data solutions such as HDInsight can provide a robust, high performance, and cost-effective data storage and
querying mechanism. Data is replicated in the storage system, and queries are distributed across the nodes for
fast parallel processing. HDInsight also saves the output from queries in the ASV replicated blob storage.
This combination of capabilities means that you can use HDInsight as a basic data warehouse. The low cost of
storage when compared to most relational database mechanisms that have the same level of reliability also
means that you can use HDInsight simply as a commodity storage mechanism for huge volumes of data, even if
you decide not to transform it into Hive tables.
If you need to store vast amounts of data, irrespective of the format of that data, a Big Data solution such
as HDInsight can reduce administration overhead and save money by minimizing the need for the high
performance database servers and storage clusters used by traditional relational database systems. You may
even choose to use a cloud hosted Big Data solution in order to reduce the requirements and running costs of on-
premises infrastructure.
The data warehouse or commodity data store model is typically suited to the following scenarios:
Exposing both the source data and the results of queries executed over this data in the familiar row and
column format, using a range of data types for the columns that includes both primitive types (including
timestamps) and complex types such as arrays, maps, and structures.
Storing schemas (or, to be precise, metadata) for tables that are populated by the queries you execute.
Creating indexes for tables, and partitioning tables based on a clustered index so that each has a
separate metadata definition and can be handled separately.
Creating views based on tables, and creating functions for use in both tables and queries.
Creating data repositories that are robust and reasonably low cost to maintain, which is especially
useful if you need to store and manage huge volumes of data.
Consuming data directly from HDInsight in business applications, through interactive analytical tools
such as Excel, or in corporate reporting platforms such as SQL Server Reporting Services.
Storing large quantities of data is a robust yet cost-effective way, for use now or in the future.
Architecture Overview
Figure 4 shows an overview of the architecture for this model. The source data files may be obtained from
external sources or as internal data generated by your business processes and applications. You might decide to
use this model instead of using a locally installed data warehouse based on a relational database model.
However, there are some limitations when using a Big Data solution such as HDInsight as a data warehouse.

Figure 4
High-level architecture for the data warehouse or commodity data store model
This model is also suitable for use as a data store where you do not need to implement the typical data
warehouse capabilities. For example, you may just want to minimize storage cost when saving large tabular
format data files for use in the future, large text files such as email archives or data that you must keep for legal
or regulatory reasons but you do not need to process, or for storing large quantities of binary data such as
images or documents. In this case you simply load the data into the Windows Azure blob storage associated with
cluster, without creating Hive tables for it.
Big Data storage is robust when a cluster is running because the data is replicated at least three times
across the nodes of the cluster. When you use HDInsight, the source data in Windows Azure storage is replicated
three times as well, including geo-replication unless you specifically disable this option.
You might also consider storing partly processed data where you have performed some translation or summary
of the data but it is still in a relatively raw form that you want to keep in case it is useful in the future. For
example, you might use Microsoft StreamInsight to allocate incoming positional data from a fleet of vehicles
into separate categories or areas, and add some reference key to each item, then store the results ready for
processing at a later date. Stream data typically generates very large files, and may arrive in rapid bursts, so
using a solution such as HDInsight helps to minimize the load on your existing data management systems.
When you need to process the archived data, you create an HDInsight cluster based on the Windows Azure blob
storage container holding that data. When you finish processing the data you can tear down the cluster without
losing the original archived data.
See the section Retaining Data and Metadata for a Cluster in Chapter 7, and Scenario 3 in Chapter 10 of this
guide, for information about preserving or recreating metadata such as Hive table definitions when you tear
down and then recreate an HDInsight cluster.
Data Sources for the Data Warehouse or Commodity Data Store Model
Data sources for this model are typically data collected from internal and external business processes, but may
also include reference data and datasets obtained from other sources that can be matched on a key to existing
data in your data store so that it can be used to augment the results of analysis and reporting processes. Some
examples are:
Data generated by internal business processes, websites, and applications.
Reference data and data definitions used by business processes.
Datasets obtained from Windows Azure Data Market and other commercial data providers.

If you adopt this model simply as a commodity data store rather than a data warehouse, you might also load
data from other sources such as social media data, log files, and sensors; or streaming data filtered or processed
through SQL Server StreamInsight.
Output Targets for the Data Warehouse or Commodity Data Store Model
The main intention of this model is to provide the equivalent to an internal data warehouse system based on the
traditional relational database model, and expose it as Hive tables. You can use these tables in a variety of ways,
such as:
By combining the datasets for analysis, and to generate reports and business information
To generate ancillary information such as related items or recommendation lists for use in
applications and websites.
For external access through web applications, web services, and other services.
To power information systems such as SharePoint server through web parts and the Business Data
Connector (BDC).

If you adopt this model simply as a commodity data store rather than a data warehouse, you might use the data
you store as an input for any of the models described in this chapter.
The data in an HDInsight data warehouse can be analyzed and visualized using any tools that can consume Hive
tables. Typical examples are:
SQL Server Reporting Services
SQL Server Analysis Services
Interactive analytical tools such as Excel, PowerPivot, Power View, and GeoFlow
Custom or third party analysis and visualization tools

You will see more details of the reporting tools in Chapter 5, Consuming and Visualizing the Results, of this
guide. The section Corporate Data Model Level Integration later in this chapter discusses SQL Server Analysis
Services. You can also download a case study that describes using SQL Server Analysis Services with Hive.
Considerations for the Data Warehouse or Commodity Data Store Model
There are some important points to consider when choosing the data collection, computation, and analysis
model for querying and visualizing data with HDInsight.
This model is typically used when you want to:
Create a central point for analysis and reporting of Big Data by multiple users and tools.
Store multiple datasets for use by internal applications and tools.
Host your data in the cloud to benefit from reliability and elasticity, minimize cost, and reduce
administration overhead.
Store both externally collected data and data generated by internal tools and processes.
Refresh the data at scheduled intervals or on demand.
You can use Hive in HDInsight to:
Define tables that have the familiar row and column format, with a range of data types for the
columns that includes both primitive types (including timestamps) and complex types such as
arrays, maps, and structures.
Load data from the file system into tables, save data to the file system from tables, and populate
tables from the results of running a query.
Create indexes for tables, and partition tables based on a clustered index so that each has a
separate metadata definition and can be handled separately.
Rename, alter and drop tables, and modify columns in a table as required.
Create views based on tables, and create functions for use in both tables and queries.
For more details of how to work with Hive tables, see Hive Data Definition Language on the Apache
Hive website.
The main limitation of Hive tables compared to a relational database is that you cannot create
constraints such as foreign key relationships that are automatically managed.
Incoming data may be processed by any type of query, not just Hive, to cleanse and validate the data
before converting it to table format.
You can store the Hive queries and views within HDInsight so that they can be used in much the same
way as the stored procedures in a relational database to extract data on demand. However, keep in
mind that the core Hadoop engine is designed for batch processing by distributing data and queries
across all the nodes in the cluster. To minimize response times you will probably need to pre-process
the data where possible using queries within HDInsight, and store these intermediate results in order to
reduce the time-consuming overhead of complex queries.
You can use theHDInsight ODBC connector in SQL Server to create linked servers. This allows you to
write Transact-SQL queries that join tables in a SQL Server database to tables stored in an HDInsight
data warehouse.
If you want to be able to delete and restore the cluster, as described in Chapter 7 of this guide, you
must create the Hive tables as external table definitions. See the section Managing Hive Table Data
Location and Lifetime in Chapter 4 of this guide for more information about creating external tables in
Hive.

For an example of a scenario that uses the basic data warehouse or commodity data store
architectural style, see Scenario 3, HDInsight as a Data Warehouse, in Chapter 10 of this
guide.
Integrating HDInsight with Enterprise Data Warehouses and BI
You saw in Chapter 1 that Microsoft offers a set of applications and services to support end-to-end solutions for
working with data. The Microsoft data platform includes all of the elements that make up an enterprise level BI
system for organizations of any size.
Organizations that already use an enterprise BI solution for business analytics and reporting can extend their
analytical capabilities by using HDInsight to add new sources of data to their decision making processes. This
model is typically suited to the following scenarios:
You have an existing enterprise data warehouse and BI system that you want to augment with data
from outside your organization.
You want to explore new ways to combine data to provide better insight into history and to predict
future trends.
You want to give users more opportunities for self-service reporting and analysis that combines
managed business data and Big Data from other sources.


The information in this section will help you to understand how you can integrate
HDInsight with an enterprise BI system. However, a complete discussion of the factors
related to the design and implementation of a data warehouse and BI solution is beyond
the scope of this guide. Appendix B, An Overview of Enterprise Business Intelligence
provides a high-level description of the key features commonly found in an enterprise BI
infrastructure. You will find this useful if you are not already familiar with such systems.

Architecture Overview
Enterprise BI is a topic in itself, and there are several factors that require special consideration when integrating
a Big Data solution such as HDInsight with an enterprise BI system. If you are not already familiar with enterprise
BI systems, it is worthwhile exploring Appendix B, An Overview of Enterprise Business Intelligence of this
guide. It will help you to understand the main factors that make an enterprise BI system different from
transactional data management scenarios.
Figure 5 shows an overview of a typical enterprise data warehouse and BI solution. Data flows from business
applications and other external sources through an ETL process and into a data warehouse. Corporate data
models are then used to provide shared analytical structures such as online analytical processing (OLAP) cubes
or data mining models, which can be consumed by business users through analytical tools and reports. Some BI
implementations also enable analysis and reporting directly from the data warehouse, enabling advanced users
such as business analysts and data scientists to create their own personal analytical data models and reports.


Figure 5
Overview of a typical enterprise data warehouse and BI implementation
There are many technologies and techniques that you can use to combine HDInsight and existing BI
technologies, but as a general guide there are three primary ways in which HDInsight can be used with an
enterprise BI solution:
1. Report level integration. Data from HDInsight is used in reporting and analytical tools to augment
data from corporate BI sources, enabling the creation of reports that include data from corporate BI
sources as well as HDInsight, and also enabling individual users to combine data from both solutions
into consolidated analyses. This level of integration is typically used for creating mashups, exploring
datasets to discover possible queries that can find hidden information, and for generating one off
reports and visualizations.
2. Corporate data model level integration. HDInsight is used to process data that is not present in the
corporate data warehouse, and the results of this processing are then added to corporate data
models where they can be combined with data from the data warehouse and used in multiple
corporate reports and analysis tools. This level of integration is typically used for exposing the data in
specific formats to information systems, and for use in reporting and visualization tools.
3. Data warehouse level integration. HDInsight is used to prepare data for inclusion in the corporate
data warehouse. The HDInsight data that has been loaded is then available throughout the entire
enterprise BI solution. This level of integration is typically used to create standalone tables on the
same database hardware as the enterprise data warehouse, which provides a single source of
enterprise data for analysis, or to incorporate the data into a dimensional schema and populate
dimension and fact tables for full integration into the BI solution.

Figure 6 shows an overview of these three levels of integration.


Figure 6
Three levels of integration for HDInsight with an enterprise BI system
The following sections describe these three integration levels in more detail to help you understand the
implications of your choice. They also contain guidelines for implementing each one. However, keep in mind
that you dont need to make just a single choice for all of your processes; you can use a different approach for
each dataset that you extract from HDInsight, depending on the scenario and the requirements for that dataset.
Choose a suitable integration level for each dataset or data querying operation you carry out. You arent
limited to using the same one for every task that interacts with your Big Data solution.
Report Level Integration
Using HDInsight in a standalone non-integrated way, as described earlier in this chapter, can unlock information
from currently unanalyzed data sources. However, much greater business value can be obtained by integrating
the results from HDInsight queries with data and BI activities that are already present in the organization.
In contrast to the rigidly defined reports created by BI developers, a growing trend is a self-service approach.
In this approach the data warehouse provides some datasets based on data models and queries defined there,
but the user selects the datasets and builds a custom report. The ability to combine multiple data sources in a
personal data model enables a more flexible approach to data exploration that goes beyond the constraints of a
formally managed corporate data warehouse. Users can augment reports and analysis of data from the
corporate BI solution with additional data from HDInsight to create a mashup solution that brings data from
both sources into a single, consolidated report.
You can use the following techniques to integrate output from HDInsight with enterprise BI data at the report
level.
Use the Data Explorer add-in to download the output files generated by HDInsight and open them in
Excel, or import them into a database for reporting.
Create Hive tables in HDInsight and consume them directly from Excel (including using PowerPivot or
GeoFlow) or from SQL Server Reporting Services (SSRS) by using the ODBC driver for Hive.
Use Sqoop to transfer the results from HDInsight into a relational database for reporting. For example,
copy the output generated by HDInsight into a Windows Azure SQL Database table and use Windows
Azure SQL Reporting Services to create a report from the data.
Use SQL Server Integration Services (SSIS) to transfer and, if required, transform HDInsight results to a
database or file location for reporting. If the results are exposed as Hive tables you can use an ODBC
data source in an SSIS data flow to consume them. Alternatively, you can create an SSIS control flow
that downloads the output files generated by HDInsight and uses them as a source for a data flow.

Microsoft SQL Server Connector for Apache Hadoop (SQL Server-Hadoop Connector) is a Sqoop-based connector
for data transfer between SQL Server 2008 R2 and Hadoop. You can download this from the Microsoft Download
Center.

Corporate Data Model Level Integration
Integration at the report level makes it possible for advanced users to combine data from HDInsight with
existing corporate data sources to create data mashups. However, there may be some scenarios where you
need to deliver combined information to a wider audience, or to users who do not have the time or ability to
create their own complex data analysis solutions.
Alternatively, you might want to take advantage of the functionality available in corporate data modeling
platforms such as SQL Server Analysis Services (SSAS) to add value to the Big Data insights gained from
HDInsight. By integrating data from HDInsight into corporate data models you can accomplish both of these
aims, and use the data as the basis for enterprise reporting and analytics.

Integrating the output from HDInsight with your corporate data models allows you to use tools such as
SQL Server Analysis Services to analyze the data and present it in a format that is easy to use in reports, or for
performing deeper analysis.

You must choose between the multidimensional and the tabular data mode when you install
SQL Server Analysis Services, though you can install two instances if you need both modes.
When installed in tabular mode, SSAS supports the creation of data models that include data
from multiple diverse sources, including ODBC-based data sources such as Hive tables.
You can use the following techniques to integrate HDInsight results into a corporate data model:
Create Hive tables in HDInsight and consume them directly from a SQL Server Analysis Services (SSAS)
tabular model by using the ODBC driver for Hive. SSAS in tabular mode supports the creation of data
models from multiple data sources and includes an OLE DB provider for ODBC, which can be used as a
wrapper around the ODBC driver for Hive.
Create Hive tables in HDInsight and then create a linked server in the instance of the SQL Server
database source used by an SSAS multidimensional data model so that the Hive tables can be queried
through the linked server and imported into the data model. SSAS in multidimensional mode can only
use a single OLE DB data source, and the OLE DB provider for ODBC is not supported.
Use Sqoop or SSIS to copy the data from HDInsight to a SQL Server database engine instance that can
then be used as a source for an SSAS tabular or multidimensional data model.


When installed in multidimensional mode, SSAS data models cannot be based on ODBC sources due to some
restrictions in the designers for multidimensional database objects. To use Hive tables as a source for a
multidimensional SSAS model, you must either extract the data from Hive into a suitable source for the
multidimensional model (such as a SQL Server database), or use the ODBC driver for Hive to a define a linked
server in a SQL Server instance that provides pass through access to the Hive tables, and then use the SQL
Server instance as the data source for the multidimensional model. You can download a case study that
describes how to create a multidimensional SSAS model that uses a linked server in a SQL Server instance to
access data in Hive tables.

Data Warehouse Level Integration
Integration at the corporate data model level enables you to include data from HDInsight in a data model that is
used for analysis and reporting by multiple users, without changing the schema of the enterprise data
warehouse. However, in some cases you might want to use HDInsight as a component of the overall ETL solution
that is used to populate the data warehouse, in effect using a source of Big Data queried through HDInsight just
like any other business data source in the BI solution, and consolidating the data from all sources to an
enterprise dimensional model.
Full integration with a data warehouse at the data level allows HDInsight to become just another data
source that is consolidated into the system, and which can be used in exactly the same way as data from other
sources. In addition, as with other data sources, its likely that the data import process from HDInsight into the
database tables will occur on a schedule, depending on the time taken to execute the queries and perform ETL
tasks prior to loading it into the database tables, so that the data is as up to date as possible.
You can use the following techniques to integrate data from HDInsight into an enterprise data warehouse.
Use Sqoop to copy data from HDInsight directly into database tables. These might be tables in a SQL
Server data warehouse that you do not want to integrate into the dimensional model of the data
warehouse; or they may be tables in a staging database where the data can be validated, cleansed, and
conformed to the dimensional model of the data warehouse before being loaded into the fact and
dimension tables. Any firewalls located between HDInsight and the target database must be configured
to allow the database protocols that Sqoop uses.
Use PolyBase for SQL Server to copy data from HDInsight directly into database tables. These might be
tables in a SQL Server data warehouse that you do not want to integrate into the dimensional model of
the data warehouse; or they may be tables in a staging database where the data can be validated,
cleansed, and conformed to the dimensional model of the data warehouse before being loaded into the
fact and dimension tables. Any firewalls located between HDInsight and the target database must be
configured to allow the database protocols that SQL Server uses.
Create an SSIS package that reads the output file from HDInsight, or uses the ODBC driver for Hive to
extract data from HDInsight, and then validates, cleanses, and transforms the data before loading it into
the fact and dimension tables in the data warehouse.

PolyBase is a component of SQL Server Parallel Data Warehouse (PDW) and is available only
on PDW appliances. For more information see the PolyBase page on the SQL Server website.
When reading data from HDInsight you must open port 1000 in the HDInsight management
portal. When reading data from SQL Server using a connector you must enable the
appropriate firewall rules. For more information see Configure the Windows Firewall to
Allow SQL Server Access.
Data Sources for the HDInsight Integration with Enterprise BI Model
The input data for HDInsight can be almost anything, but for the HDInsight integration with enterprise BI model
they typically include the following:
Social media data, log files, sensors, and applications that generate data files.
Datasets obtained from Windows Azure Data Market and other commercial data providers.
Streaming data filtered or processed through SQL Server StreamInsight.
Output Targets for the HDInsight Integration with Enterprise BI Model
The results from your HDInsight queries can be visualized using any of the wide range of tools that are available
for analyzing data, combining it with other datasets, and generating reports. Typical examples for the HDInsight
integration with enterprise BI model are:
SQL Server Reporting Services.
SharePoint Server or another information management system.
Business performance dashboards such as PerformancePoint Services in SharePoint Server.
Interactive analytical tools such as Excel, PowerPivot, Power View, and GeoFlow.
Custom or third party analysis and visualization tools.

You will see more details of these tools in Chapter 5, Consuming and Visualizing the Results, of this guide.
Considerations for the HDInsight Integration with Enterprise BI Model
There are some important points to consider when choosing the HDInsight integration with enterprise BI model
for querying and visualizing data with HDInsight.
This model is typically used when you want to:
Integrate external data sources with your enterprise data warehouse.
Augment the data in your data warehouse with external data.
Update the data at scheduled intervals or on demand.
ETL processes in a data warehouse usually execute on a scheduled basis to add new data to the
warehouse. If you intend to integrate HDInsight into your data warehouse so that it updates the
information stored there, you must consider how you will automate and schedule the tasks of executing
the HDInsight query and importing the results.
You must ensure that data imported from HDInsight contains valid values, especially where there are
typically multiple common possibilities (such as in street addresses and city names). You may need to
use a data cleansing process such as Data Quality Services to force such values to the correct leading
value.
Most data warehouse implementations use slowly changing dimensions to manage the history of values
that change over time. Different versions of the same dimension member have the same alternate key
but unique surrogate keys, and so you must ensure that data imported into the data warehouse tables
uses the correct surrogate key value. This means that you must either:
Use some complex logic to match the business key in the source data (which will typically be the
alternate key) with the correct surrogate key in the data model when you join the tables. If you
simply join on the alternate key, some loss of data accuracy may occur because the alternate
key is not guaranteed to be unique.
Load the data into the data warehouse and conform it to the dimensional data model, including
setting the correct surrogate key values.
For more information about leading values and surrogate keys in an enterprise BI system see Appendix
B, An Overview of Enterprise Business Intelligence.

One of the trickiest parts of full data warehouse integration is matching rows imported from a Big Data
solution to the correct members in the data warehouse dimension tables. You must use a combination of the
alternate key and the point in time to which the imported row relates to look up the correct surrogate key. This
key can differ based on the date that changes were made to the original entity. For example, a product may have
more than one surrogate key over its lifetime, and the imported data must match the correct version of this key.
The following table summarizes the primary considerations for deciding whether to adopt one of the three
integration approaches described in this chapter. Each approach has its own benefits and challenges.
Integration Level Typical Scenarios Considerations
None No enterprise BI solution currently exists.
The organization wants to evaluate HDInsight and
Big Data analysis without affecting current BI and
business operations.
Business analysts in the organization want to explore
external or unmanaged data sources not included in
the managed enterprise BI solution.
Open-ended exploration of unmanaged data could
produce a lot of business information, but without
rigorous validation and cleansing the data might not
be completely accurate.
Continued experimental analysis may become a
distraction from the day-to-day running of the
business, particularly if the data being analyzed is
not related to core business data.
Report Level A small number of individuals with advanced self-
service reporting and data analysis skills need to
augment corporate data that is available from the
enterprise BI solution with Big Data from other
sources.
A single report combining corporate data and some
external data is required for one-time analysis.
It can be difficult to find common data values on
which to join data from multiple sources to gain
meaningful comparisons.
Integrated data in a report can be difficult to share
and reuse, though the ability to share PowerPivot
workbooks in SharePoint Server may provide a
solution for small to medium sized groups of users.
Corporate Data
Model Level
A wide audience of business users must consume
multiple reports that rely on data from both the
corporate data warehouse and HDInsight. These
users may not have the skills necessary to create
their own reports and data models.
Data from HDInsight is required for specific business
logic in a corporate data model that is used for
reports and dashboards.
It can be difficult to find common data values on
which to join data from multiple sources.
Integrating data from HDInsight into tabular SSAS
models can be accomplished easily through the
ODBC driver for Hive. Integration with
multidimensional models is more challenging.
Data Warehouse
Level
The organization wants a complete, managed BI
platform for reporting and analysis that includes
business application data sources as well as Big
Data sources.
The desired level of integration between data from
HDInsight and existing business data, and the
required tolerance for data integrity and accuracy
across all data sources, necessitates a formal
dimensional model with conformed dimensions.
Data warehouses typically have demanding data
validation requirements.
Loading data into a data warehouse that includes
slowly changing dimensions with surrogate keys
can require complex ETL processes.

For an example of a scenario that uses the enterprise data warehouse and BI integration
architectural style see Scenario 4, Enterprise BI Integration, in Chapter 11 of this guide.
Summary
In this chapter you have seen the general use cases for using HDInsight as part of your business information and
data management systems. There are, of course, many specific use cases that fall within those identified in this
chapter. However, all are subcases of the models you have seen in this chapter, which are:
As a data analysis experimentation and visualization tool for handling data that you cannot process
using existing systems.
As a data transfer, data cleansing, or ETL mechanism to extract and transform data before you load it
into your existing databases or data visualization tools.
As a basic data warehouse or commodity storage mechanism to store both the source data and the
results of queries executed over this data, or as a data repository that is robust and reasonably low cost
to maintain.
Integration with an enterprise data warehouse and BI system at different levels, depending on the way
that you intend to use the data obtained from HDInsight.

Each model has different advantages and suits different scenarios. There is no single approach that is best, and
you must decide based on your own circumstances, the existing tools you use, and the ultimate aims you have
for your Big Data implementation. Theres nothing to prevent you from using it in all of the ways you have seen
in this chapter!
You have also seen how analytics and data visualization are vitally important parts of most organizations IT
systems. Organizations typically require their databases to power the daily processes of taking orders,
generating invoices, paying employees, and a myriad other tasks. However, accurate reporting information and
analysis of the data the organization holds and collects from external sources is becoming equally vital to
maintain competitiveness, minimize costs, and successfully exploit business opportunities.
More Information
For more information about HDInsight, see the Windows Azure HDInsight web page.
For more details of how HDInsight uses Windows Azure blob storage, see Using Windows Azure Blob Storage
with HDInsight on the Windows Azure HDInsight website.
For more information about Microsoft SQL Server data warehouse solutions see Data Warehousing on the SQL
Server website.
The document Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS) contains a wealth of
useful information about using SSIS with HDInsight.

Chapter 3: Collecting and Loading Data
This chapter explores how you can get data into your Big Data solution. It focuses primarily on Windows Azure
HDInsight, but many of the techniques you will see described here are equally relevant to Big Data solutions
built on other Big Data frameworks and implementations.
The chapter starts with an exploration of several different but typical data ingestion patterns that are generally
applicable to any Big Data solution. These patterns include ways to handle streaming data and techniques for
automating the ingestion process. The chapter then describes how you might implement these patterns when
using Windows Azure HDInsight, including details of how you can apply compression to the source data before
you load it.
Subsequent chapters of this guide explore techniques for executing queries after you have loaded the data, and
consuming the results of your queries.
Challenges
The first stage in using a Big Data solution to analyze your data is to load that data into the cluster storage. In
some cases, such as when the data source is an internal business application or database, extracting the data
into a text file that can be consumed by your solution is relatively straightforward. In the case of external data
obtained from sources such as governments and commercial data providers, the data is often available for
download in text format. However, in many cases you may need to extract data through a web service or other
API.
As you saw in the section Data Storage in Windows Azure HDInsight in Chapter 2, HDInsight for Windows
Azure uses an HDFS-compatible storage implementation that stores the data in Windows Azure blob storage as
an Azure Storage Vault (ASV), instead of on each server in the cluster. The data for an HDInsight cluster is
transferred from blob storage to each data node in the cluster when you execute a query.
Chapter 2 explains why Windows Azure HDInsight uses blob storage for the cluster data. One advantage
is that you can upload data to HDInsight before you define and initialize your cluster, or when the cluster is not
running, which helps to minimize the runtime costs of the cluster.
Generally, you submit source data in the form of a text file often delimited, but sometimes completely
unstructured. Source files for data processing in HDInsight are uploaded to folders in Windows Azure blob
storage, from where HDInsight map/reduce jobs can load and process them. Usually this is the ASV container
associated with the cluster, although you can upload the files first and then associate a new cluster with the
blob container.
There are several challenges that you face when loading the data into cluster storage. First, unless you are
simply experimenting with small datasets, you must consider how you will handle very large volumes of data.
While small volumes of data can be copied into Windows Azure blob storage interactively by using the HDInsight
interactive console, Visual Studio, or other tools, you will need to build a more robust solution capable of
handling large files when you move beyond the experimentation stage.
You may also need to load data from a range of different data sources such as websites, RSS feeds, custom
applications and APIs, relational databases, and more. Its vital to ensure that you can submit this data efficiently
and accurately to cluster storage, including performing any preprocessing that may be required on the data to
convert it into a suitable form.
For example, you may want to split the data into smaller files for easier selection and uploading, such as creating
separate files for each month in the data, which can reduce the processing and file handling required for some
queries. However, using small files can hit performance, and so if you have a lot of small files you may want to
combine these into larger files to make better use of the splitting mechanisms that Hadoop uses when reading,
sorting, and writing data; which can improve performance in some cases.
You may also need to format individual parts of the data by, for example, combining fields in an address,
removing duplicates, converting numeric values to their text representation, or changing date strings to
standard numerical date values. The operations that you need carry out might depend on the original data
source and format, and type of query that you intend to perform, and the format you require for the results.
You may even want to perform some automated data validation and cleansing by using a technology such as
SQL Server Data Quality Services before submitting the data to cluster storage. For example, you might need to
convert different versions of the same value into a single leading value (such as changing NY and Big Apple
into New York).
However, keep in mind that many of these data preparation tasks may not be practical, or even possible, when
you have very large volumes of data. They are more likely to be possible when you stream data from your data
sources, or extract it in small blocks on a regular basis. Where you have large volumes of data to process you will
probably perform these preprocessing tasks within your Big Data solution as the initial steps in a series of
transformations and queries.
Be careful when removing information from the data; if possible keep a copy of the
original files. You may subsequently find the fields you removed are useful as you refine
queries, or if you use the data for a completely different analytical task.
You must also consider security and efficiency, both of the upload process and the storage itself. You can use SSL
when uploading data to protect it on the wire. If you need to encrypt the data during the upload process, and
perhaps also while in storage, you must consider how, when, and where you will unencrypt it; and how you will
encrypt the results if this is also a requirement. In addition to encrypting the data, you may also need to
compress it to minimize storage use, and to minimize data transfer time and cost.
Finally, you must consider how the input data will arrive for processing. It may be in the form of files that are
ready to be uploaded (after any pre-processing that is required), but in some cases you may need to handle data
that arrives as a stream. This can impose challenges because Hadoop is a batch processing job engine. It means
that you must be able to convert and buffer the incoming data so that it can be processes in batches.

This chapter addresses all of these challenges described here. It covers:
Common patterns for loading data into a Big Data solution such as HDInsight.
Tools and technologies for HDInsight data ingestion.
Considerations for uploading data to HDInsight, such as compression and the use of custom upload
tools.

Common Data Load Patterns for Big Data Solutions
This section describes some common use cases for obtaining source data and uploading it to your cluster
storage. The patterns are:
Interactive data ingestion. This approach is typically used for small volumes of data and when
experimenting with datasets.
Automated batch upload. If you decide to integrate a Big Data query process with your existing systems
you will typically create an automated mechanism for uploading the data into the cluster storage.
Real-time data stream capture. If you need to process data that arrives as a stream you must
implement a mechanism that loads this data into cluster storage.
Loading relational data. Sometimes you might need to load data into a Big Data solution from a
relational database. There are several tools you can use to achieve this.
Loading web server log files. Big Data solutions are often used to analyze server log files. You must
decide how you will transfer these files into cluster storage for processing.

The following sections of this chapter describe each of these patterns in more detail. A suggested architectural
implementation is described for each pattern. Practical implementations of the first three patterns in this list,
based on collecting and loading data from Twitter, are described in Appendix A of this guide.
For information about how you can upload data directly into Hive external tables, see the section Processing
Data with Hive in Chapter 4.
Interactive Data Ingestion
An initial exploration of data with a Big Data solution often involves experimenting with a relatively small
volume of data. In this case there is no requirement for a complex, automated data ingestion solution, and the
tasks of obtaining the source data, preparing it for processing, and uploading it to cluster storage can be
performed interactively.
There are several ways you can upload data to your HDInsight cluster. However, creating automated
processes to do this is probably only worthwhile when you know you will repeat the operation on a regular basis.
If the source data is not already available as an appropriately formatted file, you can use Excel to extract a
relatively small volume of tabular data from a data source, reformat it as required, and save it as a delimited text
file. Excel supports a range of data sources, including relational databases, XML documents, OData feeds, and
the Windows Azure Data Market. You can also use Excel to import a table of data from any website, including an
RSS feed. In addition to the standard Excel data import capabilities, you can use add-ins such as the Data
Explorer to import and transform data from a wide range of sources.
After you have used Excel to prepare the data and save it as a text file, you can upload it using the hadoop dfs -
copyFromLocal [source] [destination] command at the Hadoop command line, or up-load it to an HDInsight
cluster by using the fs.put() command in the interactive console. This capability is useful if you are just
experimenting or working on a proof of concept. If you need to upload a file larger than this you can use a
command line tool such as AzCopy to upload data to ASV storage, a UI-based tool such as Windows Azure
Storage Explorer, a script that uses the Windows Azure PowerShell Cmdlets, or Server Explorer in Visual Studio.
This approach for HDInsight is shown in Figure 1.

Figure 1
Interactive Data Ingestion
Automated Batch Upload
When a Big Data solution grows beyond the initial exploration stages it is common to automate data ingestion
so that data can be extracted from sources and loaded into the cluster on a scheduled basis without human
intervention. SQL Server Integration Services (SSIS) provides a platform for building automated Extract,
Transform, and Load (ETL) workflows that perform the necessary tasks to extract source data, apply
transformations, and upload the resulting data to the cluster.
An SSIS solution consists of one or more packages, each containing a control flow to perform a sequence of
tasks. Tasks in a control flow can include calls to web services, FTP operations, file system tasks, command line
tasks, and others. In particular, a control flow usually includes one or more data flow tasks that encapsulate an
in-memory, buffer-based pipeline of data from a source to a destination, with transformations applied to the
data as it flows through the pipeline.
You can create an SSIS package that extracts data from a source and uploads it to the cluster, and then
automate execution of the package using the SQL Server Agent service. The package usually consists of a data
flow to extract the data from its source through an appropriate source adapter, and then apply any required
transformations to the data. To get the data into the appropriate destination folder you can use one of the
following options:
Procure or implement a custom destination adapter that programmatically writes the data to the
cluster, and add it to the data flow.
Use a task in the data flow in the SSIS package to download the data to a local file, and then use a
command line task in the control flow to automate a file upload utility that uploads the file.

As an alternative you can use scripts, tools, or utilities to perform the upload, perhaps driven by a scheduler
such as Windows Scheduled Tasks. Some options for this approach are:
Use the Hadoop .Net HDFS File Access library that is installed in the C:\Apps\dist\contrib\WinLibHdfs
folder of HDInsight.
Use a third party utility or library that can upload files to Windows Azure blob storage.
Use the Windows Azure blob storage API. For more details, see How to use the Windows Azure Blob
Storage Service in .NET.
Create a custom utility using the .NET SDK for Hadoop. For more details, see the section Using the
WebHDFSClient in the Microsoft .NET SDK for Hadoop later in this chapter.

Figure 2 shows an automated batch upload architecture for HDInsight.

Figure 2
Automated batch upload with SSIS
For more information about creating ETL solutions with SSIS, see SQL Server Integration Services in SQL Server
Books Online. There is also a set of sample components available to simplify creating SSIS packages for
uploading data to Windows Azure blob storage. For more information, see Azure Blob Storage Components for
SSIS Sample.

If you already use SQL Server Integration Services as part of your BI system you should consider creating
packages that automate the uploading of data to HDInsight for regular analytical processes.

Real-Time Data Stream Capture
In some scenarios the data that must be analyzed is available only as a real-time stream of events; for example,
from a streaming web feed or electronic sensors. Microsoft StreamInsight is a complex event processing (CEP)
engine with a framework API for building applications that consume and process event streams. Big Data
solutions are optimized for batch processing of large volumes of data, but you can use StreamInsight to capture
events that are redirected to a real-time visualization solution and are also captured for submission to the Big
Data solution storage for historic analysis.
StreamInsight is available as an on-premises solution, and is under development as a Windows Azure service.
Regardless of the implementation you decide to use, a StreamInsight application consists of a standing query
that uses LINQ syntax to extract events from a data stream. Events are implemented as classes or structs, and
the properties defined for the event class provide the data values for visualization and analysis. You can create a
StreamInsight application by using the Observable/Observer pattern, or by implementing input and output
adapters from base classes in the StreamInsight framework.

Big Data solutions cannot handle streaming data sourcesthey work by querying across distributed files
stored on disk. StreamInsight is one solution for converting incoming stream data into files that can be included
in a batch process such as an HDInsight query, although you can create your own mechanism for capturing and
storing data from a stream.
When using StreamInsight to capture events for analysis in a Big Data solution you can implement code to write
the captured events to a local file for batch upload to the cluster storage, or you can implement code that writes
the event data directly to the cluster storage. Your code can append each new event to an existing file, create a
new file for each individual event, or periodically create a new file based on temporal windows in the event
stream.

Figure 3 shows a real-time data stream capture architecture for HDInsight.

Figure 3
Real-time data stream capture with StreamInsight
For an example of using StreamInsight, see Appendix A of this guide. For more information about developing
StreamInsight applications, see the Microsoft StreamInsight on MSDN.
Loading Relational Data
Source data for analysis is often obtained from a relational database. To support this common scenario you can
use Sqoop to extract the data you require from a table, view, or query in the source database and save the
results as a file in your cluster storage.
In HDInsight this approach makes it easy to transfer data from a relational database when your infrastructure
supports direct connectivity from the HDInsight cluster to the database server, such as when your database is
hosted in Windows Azure SQL Database or a virtual machine in Windows Azure.
Approaches for loading relational data into HDInsight are shown in Figure 4.


Figure 4
Loading relational data into HDInsight
Its quite common to use HDInsight to query data extracted from a relational database.
For example, you may use it to search for hidden information in data exported from your
corporate database or data warehouse as part of a sandboxed experiment, without
absorbing resources from the database or risking damaging the existing data.
Loading Web Server Log Files
Analysis of web server log data is another common use case for Big Data solutions, and requires that log files be
uploaded to the cluster storage. Flume is an open source project for an agent-based framework to copy log files
to a central location, and is often used to load web server logs to HDFS for processing in Hadoop.
Flume is not included in HDInsight, but can be downloaded from the Flume project web site and installed on
Windows. As an alternative to using Flume, you can use SSIS to implement an automated batch upload solution,
as described earlier.
Options for uploading web server log files in HDInsight are shown in Figure 5.

Figure 5
Loading web server logs into HDInsight
Uploading Data to HDInsight
The preceding sections of this chapter describe the general patterns for uploading data to the cluster storage of
a Big Data solution. The interactive data ingestion approach described earlier in this chapter is fine for
experimentation and carrying out one-off analysis of datasets. However, when you have a large number of files
to upload, or you want to implement a repeatable solution, using the interactive console is not practical.
Instead, you may want to take advantage of third party tools, or create your own custom mechanisms to upload
data. In addition, in some situations you may need to encrypt the data in storage, or automatically compress it
as part of an automated upload process. HDInsight for Windows Azure uses Windows Azure blob storage to host
the data. In this section you will see some of the practical considerations and implementation guidance for
automating data upload to Windows Azure blob storage. You will find more details about compressing the data
later in this chapter.
Tools and Technologies for HDInsight Data Ingestion
You can create custom applications or scripts to extract source data, or use off-the-shelf tools and platforms.
Common tools on the Microsoft platform for extracting data from a source include Excel, SQL Server Integration
Services (SSIS), and StreamInsight. To upload the data to HDInsight you can use a variety of file upload tools such
as AzCopy (which can be used as a standalone command line tool or as a library from your own code), or you can
build a custom solution.
The Hadoop ecosystem includes tools that are specifically designed to simplify the loading or particular kinds of
data into cluster storage. Sqoop provides a mechanism for transferring data between a relational database and
cluster storage, and Flume provides a framework for copying web server log files to an HDFS location in cluster
storage.
For useful list of third party tools, see How to Upload Data to HDInsight on the Windows
Azure website. This page also demonstrates how to use the Windows Azure Storage Explorer.
When experimenting with HDInsight, remember that you can access Windows Azure storage
from Server Explorer after you install the Windows Azure SDK for Visual Studio. You can also
use the Windows Azure PowerShell Cmdlets in your scripts to upload data to Windows Azure
blob storage.

Consider the following points when you are planning your data upload strategy:
Many of the HDInsight tools (such as Hive and Pig) use the complete contents of a folder as a data
source, rather than an individual file, so you can split your data across multiple files in the same folder
and still be able process it in a single query. However, as a general rule it is usually more efficient to
consolidate your data into larger files before uploading them if you have a very large number of small
files, or append new data to existing files. The default block size that HDInsight uses to distribute data
for processing in parallel across multiple cluster nodes is 256 MB, although this can be changed by
setting the dfs.block.size property. In some cases (such as when processing image files) it may not be
possible to combine small files.
If possible, choose a utility that can upload data in parallel using multiple threads to reduce upload
time. Some utilities may be able to upload multiple blocks or small files in parallel and then combine
them into larger files after they have been uploaded. The optimum number of threads is typically
between 8 and 16, though you should choose a number based on the number of logical processors and
the bandwidth available. An excessive number of concurrent operations will have an adverse effect on
overall system performance due to thread pool demands and frequent context switching.
You can often improve the efficiency and reduce the runtime of jobs that use large volumes of data by
compressing the data you upload. You can do this before you upload the data, or within a custom utility
as part of the upload process.
Consider using SSL for the connection to your storage account to protect the data on the wire. The URI
scheme for ASVS provides SSL encrypted access, which is recommended.
If you are handling sensitive data that must be encrypted you will need to write custom serializer and
deserializer classes and install these in the cluster for use as the SerDe parameter in Hive statements, or
create custom map/reduce components that can manage the serialization and deserialization.

Combining small files into larger ones, or appending data to existing files, can help to reduce the work
HDInsight needs to do when managing the files and moving them between the cluster nodes. However, this may
not be practical for some types of files.
Implementing a Custom Data Upload Utility
You may need to create your own custom upload utility if you have special requirements, such as uploading
from sources other than local files, which cannot be met by third-party tools. If so, consider the following points:
Implement your solution as a library that can be called programmatically from scripts or from custom
components in an SSIS data flow or a StreamInsight application. This will ensure that the solution
provides flexibility for use in future Big Data projects as well as the current one.
Create a command line utility that uses your library so that it can be used interactively or called from an
SSIS control flow.
Include a robust connection retry strategy to maximize reliability. The storage client provided in the
Windows Azure SDK uses a built-in retry mechanism. Alternatively, you can use a library such as the
Enterprise Library Transient Fault Handling Application Block to implement a comprehensive
configurable approach if you prefer.
If you need to upload very large volumes of data you must be aware of Windows Azure blob store
limitations such as blob, container, and storage size and the maximum data writing speed.
Investigate the capabilities of existing libraries such as Windows Azure PowerShell Cmdlets, the
Windows Azure storage client library, and the.NET SDK for Hadoop.
An example solution named BlobManager is included in the Hands-on Labs you can download for this
guide. You can use it as a starting point for your own custom utilities. Download the labs from
http://wag.codeplex.com/releases/view/103405. The source code is in the Lab 4 - Enterprise BI
Integration\Labfiles\BlobUploader subfolder.

For information about the limitations of Windows Azure blob storage, see Understanding Block Blobs and Page
Blobs on MSDN.
Using the WebHDFSClient in the Microsoft .NET SDK for Hadoop
The Microsoft .NET SDK for Hadoop provides the WebHDFSClient client library, which enables you to create
custom solutions for uploading and manipulating files in the HDFS file system. By using this library you can
implement managed code that integrates processes of obtaining data from a source and uploading it to the
HDFS file system of a Hadoop cluster, which is implemented as a Windows Azure blob storage container in
HDInsight. Figure 6 shows the relevant parts of the .NET SDK for Hadoop.


Figure 6
The WebHDFSClient is part of the .NET SDK for Hadoop
To use the WebHDFSClient class you must use NuGet to import the Microsoft.Hadoop.WebClient package in
the Package Manager PowerShell console as shown here.
PowerShell
install-package Microsoft.Hadoop.WebClient
With the package and all its dependencies imported, you can create a custom application that automates HDFS
tasks in the cluster. For example, the following code sample shows how you can use the WebHDFSClient class to
upload a data file to the cluster.
C#
namespace MyWebHDFSApp
{
using System;
using Microsoft.Hadoop.WebHDFS;
using Microsoft.Hadoop.WebHDFS.Adapters;

public class Program
{
private const string HadoopUser = "your-user-name";
private const string AsvAccount = "your-asv-account-name";
private const string AsvKey = "your-storage-key";
private const string AsvContainer = "your_container";

static void Main(string[] args)
{
var localFile = @"C:\data\data.txt";
var storageAdapter = new BlobStorageAdapter(AsvAccount, AsvKey,
AsvContainer, true);
var hdfsClient = new WebHDFSClient(HadoopUser, storageAdapter);

// upload data using WebHDFSClient
Console.WriteLine("Creating folder and uploading data...");
hdfsClient.CreateDirectory("/mydata").Wait();
hdfsClient.CreateFile(localFile, "/mydata/data.txt").Wait();

Console.WriteLine("\nPress any key to continue.");
Console.ReadKey();
}
}
}
Compressing the Source Data
One technique that can, in many cases, improve query performance and reduce execution time for large source
files is to compress the source data before, or as you upload it to your cluster storage. This reduces the time it
takes for each node in the cluster to load the data from storage into memory by reducing I/O and network
usage. However, it does increase the processing overhead for each node, and so compression cannot be
guaranteed to reduce execution time.
Compressing the source data before, or as you load it into HDInsight can reduce query execution times
and improve query efficiency. If you intend to use the query regularly you should test both with and without
compression to see which provides the best performance, and retest over time if the volume of data you are
querying grows. However, before you experiment with compressing the source data for a job, apply compression
to the mapper and reducer outputs. This is easy to do and it may provide a larger gain in performance. It might
be all you need to do in order to achieve acceptable performance.
You must experiment with and without compression, and perhaps combine the tests with other changes to the
configuration, to determine the optimum solution. For example, changing the number of mappers or the total
number of cluster nodes, and enforcing compression within the process as well as compressing the source data,
will provide optimum results. Compression typically provides the best performance gains for jobs that are not
CPU intensive.
For information about maximizing runtime performance of queries, see the section Tuning the Solution in
Chapter 4. For information about how you can set configuration values in a query, see the section Configuring
HDInsight Clusters and Jobs in Chapter 7.
Compression Libraries Available in HDInsight
At the time of writing, HDInsight contains three compression codecs; though you can install others if you wish.
The following table shows the class name of each codec, the standard file extension for files compressed with
the codec, and whether the codec supports split file compression and decompression.
Format Codec Extension Splittable
DEFLATE org.apache.hadoop.io.compress.DefaultCodec .deflate No
GZip org.apache.hadoop.io.compress.GzipCodec .gz No
BZip2 org.apache.hadoop.io.compress.BZip2Codec .bz2 Yes
A codec that supports splittable compression and decompression allows HDInsight to decompress the data in
parallel across multiple mapper and node instances, which typically provides better performance. However,
splittable codecs are less efficient at runtime, so there is a trade off in efficiency between each type. There is
also a difference in the size reduction (compression rate) that each codec can achieve. For the same data, BZip2
tends to produce a smaller file than GZip but takes longer to perform the decompression.
Considerations for Compression the Source Data
Consider the following points when you are deciding whether to compress the source data:
Compression may not produce any improvement in performance with small files. However, with very
large files (for example, files over 100 GB) compression is likely to provide dramatic improvement. The
gains in performance also depend on the contents of the file and the level of compression that was
achieved.
When optimizing a job, enable compression within the process using the configuration settings to
compress the output of the mappers and the reducers before experimenting with compression of the
source data. Compression within the job stages often provides a more substantial gain in performance
compared to compressing the source data.
Consider using a splittable algorithm such as BZip2 for very large files so that they can be decompressed
in parallel by multiple tasks.
Use the default file extension for the files if possible. This allows HDInsight to detect the file type and
automatically apply the correct decompression algorithm. If you use a different file extension you must
set the io.compression.codec property for the job or for the cluster to indicate the codec used.

For more information about choosing a compression codec and compressing data see the Microsoft White Paper
Compression in Hadoop. For more information about using the compression codecs in Hadoop see the
documentation for the CompressionCodecFactory and CompressionCodec classes on the Apache website.
Techniques for Compressing Source Data
To compress your data, you can usually use the tools provided by the codec supplier. For example, the
downloadable libraries for both GZip and BZip2 include tools that run on Windows and can help you apply
compression. For more details see the distribution sources for GZip and BZip2 on Source Forge.
You can also use the classes in the.NET Framework to perform GZip and DEFLATE compression on your source
files, perhaps by writing command line utilities that are executed as part of an automated upload and processing
sequence. For more details see the System.IO.Compression.GZipStream and
System.IO.Compression.DeflateStream reference sections on MSDN.
Another alternative is to use the tools included with Hadoop. For example, you can create a query job that is
configured to write output in compressed form using one of the built-in codecs, and then execute the job
against existing uncompressed data in storage so that it selects all or some part of the source data and writes it
back to storage in compressed form. For an example of using Hive to do this see the Microsoft White Paper
Compression in Hadoop.
Summary
This chapter described the typical patterns and techniques you can use to load data into a Big Data solution,
focusing particularly on loading a cluster in Windows Azure HDInsight, and the tools available to do this. It
described interactive and automated approaches, real-time data stream capture, and some options for loading
relational data and server log files.
The chapter also explored some techniques for creating your own custom utilities to upload data, using third
party tools and libraries, and additional considerations such as compressing the source data.
In the next chapter you will see how you can execute queries and transformations on the data you have loaded
into your cluster.
More Information
For more information about HDInsight, see the Windows Azure HDInsight web page.
For a guide to uploading data to HDInsight, and some of the tools available to help, see How to Upload Data to
HDInsight on the Windows Azure HDInsight website.
For more details of how HDInsight uses Windows Azure blob storage, see Using Windows Azure Blob Storage
with HDInsight on the Windows Azure HDInsight website.
For more information about the Microsoft .NET SDK for Hadoop, see the Codeplex project site.

Chapter 4: Performing Queries and
Transformations
After you have loaded your data into HDInsight you can begin generating the query results for analysis. This
chapter describes the techniques and tools that you can use in Windows Azure HDInsight to create queries and
transformations, and gives a broad overview of how they work and how you can use them. It will help you to
understand the options you have, and to choose the most appropriate ones. The chapter also explores the two
subsequent solution development stages of evaluating the results and tuning the solution.
Challenges
Big Data frameworks offer a huge range of tools that you can use with the Hadoop core engine, and choosing
the most appropriate can be difficult. Matching compatible versions that will work well together is particularly
challenging. However, Windows Azure HDInsight simplifies the process because all of the tools it includes are
guaranteed to be compatible and work correctly together. That doesnt mean you cant incorporate other tools
and frameworks in your solution, but doing so may involve considerable effort installing them and integrating
them with your process.
A second common challenge is in deciding which query tools available in HDInsight are the best choices for your
specific tasks. At the time of writing, early in the evolution stage of HDInsight, the most popular query tool was
Hive; around 80% of queries are written in HiveQL. No doubt this is partly due to familiarity with SQL-like
languages. However, there are good reasons, as you will discover in this chapter, to consider using Pig or writing
your own map and reduce components in some circumstances.
Of course, creating the query is only one part of an HDInsight solution. You must consider what you want to do
with the data you extract, and decide on the appropriate format for this data. How will you analyze it? What
tools will you use to consume it? Do you need to load it into another database system for further processing, or
save it to Windows Azure blob storage for use in another HDInsight query? The answers to these questions will
help you to select the appropriate output format, and also contribute to the decision on which HDInsight query
tool to use.
The challenges dont end with simply writing and running a query. Its vitally important to check that the results
are realistic, valid, and useful before you invest a lot of time and effort (and cost) in developing and extending
your solution. As you saw in Chapter 1, a common use of Big Data techniques is simply to experiment with data
to see if it can offer insights into previously undiscovered information. As with any investigational or
experimental process, you need to be convinced that each stage is producing results that are both valid
(otherwise you gain nothing from the answers) and useful (in order to justify the cost and effort).
HDInsight is a great tool for experimenting with data to find hidden insights, but you must be sure that
the results you get from it are accurate and valid, as well as being useful. After you are convinced of the validity
of the process and the results you can consider incorporating the process into your organizations BI systems.
Finally, when you are convinced that your queries are producing results that are both valid and useful, you must
decide whether you want to repeat the process on a regular basis. At this point you should consider whether
you need to fine tune the queries to maximize performance, to minimize load on the cluster, or simply to reduce
the time they take to run. You may even decide to fine tune your queries during the experimentation stage if
they are taking a long time to run, even with a subset of data, and you need the results more quickly.
This chapter addresses all of the challenges described here. It covers:
Processing the data.
Evaluating the results.
Tuning the solution.

Processing the Data
As you saw in Chapter 1, all HDInsight processing consists of a map job in which multiple cluster nodes process
their own subset of the data, and a reduce job in which the results of the map jobs are consolidated to create a
single output. You can implement map/reduce code to perform the specific processing required for your data, or
you can use a query interface such as Pig or Hive to write queries that are translated to map/reduce jobs and
executed.
Hive is the most commonly used query tool in HDInsight. The SQL-like syntax of queries is familiar and the
table format of the output is ideal for use in most scenarios. However, for complex querying processes you might
need to execute a Pig script or map/reduce components first and then layer a Hive table on top of the results.
While Hive is the most commonly used tool in HDInsight when this guide was being written, many HDInsight
processing solutions are actually incremental in nature; they consist of multiple queries, each operating on the
output of the previous one.
These queries may use different query tools; for example, you might first use a custom map/reduce job to
summarize a large volume of unstructured data, and then create a Pig script to restructure and group the data
values produced by the initial map/reduce job. Finally, you might create Hive tables based on the output of the
Pig script so that client applications such as Excel can easily consume the results. Alternatively, you may create a
query that consists just of a series of cascading custom map/reduce jobs.

In this section you will see descriptions of these three HDInsight query techniques in order of increasing
complexity, and simple examples of using them. The three approaches are:
Using Hive by writing queries in HiveQL.
Using Pig by writing queries in Pig Latin.
Writing custom map and reduce components, including using the streaming interface and writing
components in .NET languages.


Typically, you will use the simplest of these approaches that can provide the results you require. It may be that
you can achieve these results just by using Hive, but for more complex scenarios you may need to use Pig or
even write your own map and reduce components. You may also decide, after experimenting with Hive or Pig,
that custom map and reduce components can provide better performance by optimizing the processing. You will
see more on this topic later in the chapter.
Youll also see information in the following sections about:
How to choose the appropriate query type for different scenarios.
How you can create and use a user-defined function (UDF) in a query.
How you can unify and stabilize query jobs by using HCatalog.


Processing Data with Hive
Fundamentally, all HDInsight processing is performed by running map/reduce batch jobs. Hive is an abstraction
layer over map/reduce that provides a query language called HiveQL, which is syntactically very similar to SQL
and supports the ability to create tables of data that can be accessed remotely through an ODBC connection.
In effect, Hive enables you to create an interface layer over map/reduce that can be used in a similar way to a
traditional relational database. Business users can use familiar tools such as Microsoft Excel and SQL Server
Reporting Services to consume data from HDInsight in a similar way as they would from a database system such
as SQL Server. Installing the ODBC driver for Hive on a client computer enables users to connect to an HDInsight
cluster and submit HiveQL queries that return data to an Excel worksheet, or to any other client that can
consume results through ODBC.
Hive uses tables to impose a schema on data, and to provide a query interface for client applications. The key
difference between Hive tables and those in traditional database systems, such as SQL Server, is that Hive
adopts a schema on read approach that enables you to be flexible about the specific columns and data types
that you want to project onto your data. You can create multiple tables with different schema from the same
underlying data, depending on how you want to use that data. You can also create views and indexes over Hive
tables, and partition tables to improve query performance.
Amongst its more usual use as a querying mechanism, Hive can be used to create a simple data
warehouse containing table definitions applied to data that you have already processed into the appropriate
format. Windows Azure storage is relatively inexpensive, and so this is a good way to create a commodity
storage system when you have huge volumes of data.
Hive is a good choice for data processing when:
The source data has some identifiable structure, and can easily be mapped to a tabular schema.
The processing you need to perform can be expressed effectively as HiveQL queries.
You want to create a layer of tables through which business users can easily query source data, and data
generated by previously executed map/reduce jobs or Pig scripts.
You want to experiment with different schemas for the table format of the output.

You can use the Hive command line tool on the HDInsight cluster to work with Hive tables, or you can use the
Hive interface in the HDInsight interactive console. You can also use the Hive ODBC driver to connect to Hive
from any ODBC-capable client application, and build an automated solution that includes Hive queries by using
the .NET SDK for HDInsight and with a range of Hadoop-related tools such Oozie and WebHCat.
For more information about automating an HDInsight solution, see Chapter 6 of this guide.
Creating Tables with Hive
You create tables by using the HiveQL CREATE TABLE statement, which in its simplest form looks similar to the
equivalent statement in Transact-SQL. You specify the schema in the form of a series of column names and
types, and the type of delimiter in the data that Hive will use to identify each column value. You can also specify
the format for the files the table data will be stored in if you do not want to use the default format (where data
files are delimited by an ASCII code 1 (Octal \001) character, equivalent to Ctrl + A). For example, the following
code sample creates a table named mytable and specifies that the data files for the table should be tab-
delimited.
HiveQL
CREATE TABLE mytable (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
Hive supports a sufficiently wide range of data types to suit almost any requirement. The
primitive data types you can use for column in a Hive table are TINYINT, SMALLINT, INT,
BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, DECIMAL (though the last
three of these are not available in all versions of Hive). In addition to these primitive types
you can define columns as ARRAY, MAP, STRUCT, and UNIONTYPE. For more information
about Hive data types, see Hive Data Types in the Apache Hive language manual.
Managing Hive Table Data Location and Lifetime
Hive tables are simply metadata definitions imposed on data in underlying files. By default, Hive stores table
data in the user/hive/warehouse/table_name path in storage (the default path is defined in the configuration
property hive.metastore.warehouse.dir), so the previous code sample will create the table metadata definition
and an empty folder at user/hive/warehouse/mytable. When you delete the table by executing the DROP
TABLE statement, Hive will delete the metadata definition from the Hive database and it will also remove the
user/hive/warehouse/mytable folder and its contents.
However, you can specify an alternative path for a table by including the LOCATION clause in the CREATE TABLE
statement. The ability to specify a non-default location for the table data is useful when you want to enable
other applications or users to access the files outside of Hive. This allows data to be loaded into a Hive table
simply by copying data files of the appropriate format into the folder. When the table is queried, the schema
defined in its metadata is automatically applied to the data in the files.
An additional benefit of specifying the location is that this makes it easy to create a table for data that already
exists in the folder (perhaps the output from a previously executed map/reduce job or Pig script). After creating
the table, the existing data in the folder can be retrieved immediately with a HiveQL query.
However, one consideration for using an existing location is that, when the table is deleted, the folder it
references will also be deleted even if it already contained data files when the table was created. If you want
to manage the lifetime of the folder containing the data files separately from the lifetime of the table, you must
use the EXTERNAL keyword in the CREATE TABLE statement to indicate that the folder will be managed
externally from Hive, as shown in the following code sample.
HiveQL
CREATE EXTERNAL TABLE mytable (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/mydata/mytable';

This ability to manage the lifetime of the table data separately from the metadata definition of the table means
that you can create several tables and views over the same data, but each can have a different schema. For
example, you may want to include fewer columns in one table definition to reduce the network load when you
transfer the data to a specific analysis tool, but have all of the columns available for another tool.
As a general guide you should:
Use INTERNAL tables (the default, commonly referred to as managed tables) when you want Hive to
manage the lifetime of the table or when the data in the table is temporary; for example, when you are
running experimental or one-off queries over the source data.
Use INTERNAL tables and also specify the LOCATION for the data files when you want to access the data
files from outside of Hive; for example, if you want to upload the data for the table directly into the
Azure Storage Vault (ASV).
Use EXTERNAL tables when you want to manage the lifetime of the data, when data is used by
processes other than Hive, or if the data files must be preserved when the table is dropped. However,
notice that cannot use EXTERNAL tables when you implicitly create the table by executing a SELECT
query against an existing table.

The important point to remember is that a Hive table is simply a metadata schema that is imposed on the
existing data in underlying files. Depending on how you declare the table, you can have Hive manage the lifetime
of the data or you can elect to manage it yourself. Therefore its important to understand how the LOCATION and
EXTERNAL parameters in a Hive query affect the way that the data is stored, and how its lifetime is linked to that
of the table definition.
Loading Data into Hive Tables
In addition to simply uploading data into the location specified for a Hive table you can use the HiveQL LOAD
statement to load data from an existing file into a Hive table. When the LOCAL keyword is included, the LOAD
statement moves the file from its current location to the folder associated with the table.
An alternative is to use a CREATE TABLE statement that includes a SELECT statement to query an existing table.
The results of the SELECT statement are used to create data files for the new table. When you use this
technique, the new table must be an internal table. Similarly, you can use an INSERT statement to insert the
results of a SELECT query into an existing table.
You can use the OVERWRITE keyword with both the LOAD and INSERT statements to replace any existing data
with the new data being inserted.
As a general guide you should:
Use the LOAD statement when you need to create a table from the results of a map/reduce job or a Pig
script. These scripts generate log and status files alongside the output file, and using the LOAD method
enables you to easily add the output data to a table without having to deal with the additional files that
you do not want to include in the table.
Use the INSERT statement when you want to load data from an existing table into a different table. A
common use of this approach is to upload source data into a staging table that matches the format of
the source data (for example, tab-delimited text). Then, after verifying the staged data, compress and
load it into a table for analysis, which may be in a different format such as a SEQUENCE FILE.
Use a SELECT query in a CREATE TABLE statement to generate the table dynamically when you just want
simplicity and flexibility. You do not need to know the column details to create the table, and you do
not need to change the statement when the source data or the SELECT statement changes. You cannot,
however, create an EXTERNAL table this way and so you cannot control the data lifetime of the new
table separately from the metadata definition.

To compress data as you insert it from one table to another, you must set some Hive parameters to specify that
the results of the query should be compressed, and specify the compression algorithm to be used. For example,
the following HiveQL statement loads compressed data from a staging table into another table.
HiveQL
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.intermediate=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

FROM stagingtable s
INSERT INTO TABLE mytable
SELECT s.col1, s.col2;

For more information about creating tables with Hive, see Hive Data Definition Language on the Apache Hive
site. For a more detailed description of using Hive see Hive Tutorial.
Querying Tables with HiveQL
After you have created tables and loaded data files into the appropriate locations you can query the data by
executing HiveQL SELECT statements against the tables. As with all data processing on HDInsight, HiveQL queries
are implicitly executed as map/reduce jobs to generate the required results. HiveQL SELECT statements are
similar to SQL, and support common operations such as JOIN, UNION, and GROUP BY. For example, you could
use the following code to query the mytable table described earlier.
HiveQL
SELECT col1, SUM(col2) AS total
FROM mytable
GROUP BY col1;
When designing an overall data processing solution with HDInsight, you may choose to perform complex
processing logic in custom map/reduce components or Pig scripts and then create a layer of Hive tables over the
results of the earlier processing, which can be queried by business users who are familiar with basic SQL syntax.
However, you can use Hive for all processing, in which case some queries may require logic that is not possible
to define in standard HiveQL functions. In addition to common SQL semantics, HiveQL supports the inclusion of
custom map/reduce scripts embedded in a query through the MAP and REDUCE clauses, as well as custom user-
defined functions (UDFs) that are implemented in Java, or that call Java functions available in the existing
installed libraries. This extensibility enables you to use HiveQL to perform complex transformations on data as it
is queried. UDFs are discussed in more detail later in this chapter.

Hive is easy to use, and yet is surprisingly powerful when you take advantage of its ability to use
functions in Java code libraries, user-defined functions, and custom map and reduce components.
To help you decide on the right approach, consider the following guidelines:
If the source data must be extensively transformed using complex logic before being consumed by
business users, consider using custom map/reduce components or Pig scripts to perform most of the
processing, and create a layer of Hive tables over the results to make them easily accessible from client
applications.
If the source data is already in an appropriate structure for querying, and only a few specific but
complex transforms are required by an experienced analyst with the necessary programming skills,
consider using map/reduce scripts embedded in HiveQL queries to generate the required results.
If queries will be created mostly by business users, but some complex logic is still regularly required to
generate specific values or aggregations, consider encapsulating that logic in custom UDFs because
these will be simpler for business users to include in their HiveQL queries than a custom map/reduce
script.

For more information about selecting data from Hive tables, see Language Manual - Select on the Apache Hive
website.
Maximizing Query Performance in Hive
There are several techniques you can use to maximize query performance when using Hive. These techniques
fall into two distinct areas:
The design of the tables. For example, partitioning the data files so that you minimize the amount of
data that a query must handle, and using indexes to reduce the time taken for lookups required by a
query.
The design of the queries that you execute against the tables.

The following sections of this chapter explore these topics in more detail. Also see the section Tuning the
Solution later in this chapter that describes a range of techniques for optimizing query performance through
configuration.
Improving Query Performance by Partitioning the Data
Advanced options when creating a table include the ability to partition, skew, and cluster the data across
multiple files and folders:
You can use the PARTITIONED BY clause to create a subfolder for each distinct value in a specified
column (for example, to store a file of daily data for each date in a separate folder). Partitioning can
improve query performance because Hive will only scan relevant partitions in a filtered query.
You can use the SKEWED BY clause to create separate files for each row where a specified column value
is in a list of specified values. Rows with values not listed are stored together in a separate single file.
You can use the CLUSTERED BY clause to distribute data across a specified number of subfolders
(described as buckets) based on hashes of the values of specified columns.
Partitioning your Hive tables is an easy way to ensure maximum query performance, especially if the data
can be distributed evenly across the partitions (or buckets). It is particularly useful when you need to join
tables, and the joining column is one of the partition columns.
When you partition a table, the partitioning columns are not included in the main table schema part of the
CREATE TABLE statement; instead they must be included in a separate PARTITIONED BY clause. The partitioning
columns can still, however, be referenced in SELECT queries. For example, the following HiveQL statement
creates a table in which the data is partitioned by a string value named partcol1.
HiveQL
CREATE EXTERNAL TABLE mytable (col1 STRING, col2 INT)
PARTITIONED BY (partcol1 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/mydata/mytable';
When data is loaded into the table, subfolders are created for each partition column value. For example, you
could load the following data into the table.
col1 col2 partcol1
ValueA1 1 A
ValueA2 2 A
ValueB1 2 B
ValueB2 3 B
After this data has been loaded into the table, the /mydata/mytable folder will contain a subfolder named
partcol1=A and a subfolder named partcol1=B, and each subfolder will contain the data files for the values in
the corresponding partitions.
When you need to load data into a partitioned table you must include the partitioning column values. If you are
loading a single partition at a time, and you know the partitioning value, you can specify explicit partitioning
values as shown in the following HiveQL INSERT statement.
HiveQL
FROM staging_table s
INSERT INTO mytable PARTITION(partcol1='A')
SELECT s.col1, s.col2
WHERE s.col3 = 'A';
Alternatively, you can use dynamic partition allocation so that Hive creates new partitions as required by the
values being inserted. To use this approach you must enable the non-strict option for the dynamic partition
mode, as shown in the following code sample.
HiveQL
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition = true;
FROM staging_table s
INSERT INTO mytable PARTITION(col3)
SELECT s.col1, s.col2, s.col3;
Improving Query Performance with Indexes
If you intend to use HDInsight as a Hive-based data warehouse for Big Data analysis and reporting you can
improve the performance of HiveQL queries by implementing indexes in the Hive tables. Hive supports a
pluggable architecture for index handlers, enabling developers to create custom indexing solutions. By default, a
single reference implementation of an index handler named CompactIndexHandler is included in HDInsight, and
you can use this to index commonly queried columns in your Hive tables.
The CompactIndexHandler reference implementation creates an index that stores the
HDFS block address for each indexed value, and it improves the performance of queries
when a table contains multiple adjacent rows with the same indexed value.
To create an index you use the HiveQL CREATE INDEX statement and specify the index handler to be used, as
shown in the following code sample.
HiveQL
CREATE INDEX idx_myindex ON TABLE mytable(col1)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;
The DEFERRED REBUILD clause in the previous code causes the index to be created, but it does not populate it
with values. To build the index you use the ALTER INDEX statement, as shown in the following code sample. You
should rebuild the index after each data load.
HiveQL
ALTER INDEX idx_myindex ON TABLE mytable REBUILD;
After the index has been built you can view the index files in the /hive/warehouse folder, where a folder is
created for each index. By default, indexes are partitioned using the same partitioning columns as the tables on
which they are based, and the folder structure for the index reflects this partitioning.
For an example of using partitioned tables and indexes, see Scenario 4: Enterprise BI
Integration in Chapter 11 of this guide.
Improving Performance by Optimizing the Query
Hive will automatically attempt to generate the most efficient code for the map and reduce components that
are executed when you run the query. However, there are some ways that you can help Hive to generate the
most performant code:
Avoid using a GROUP BY statement in your queries if possible. This statement forces Hadoop to use a
single instance of the reducer, which can cause a bottleneck in the query processing. Instead, if the
target for the results is a visualization and analysis tool (such as Microsoft Excel) use the features of that
tool to group and sort the results. Alternatively, run a second query that just groups the data on the
results of the original query, so that it operates on a smaller dataset and you can then accurately
determine the time the group operation takes and decide whether to include it or not. If you must use
GROUP BY, consider using the mapred.reduce.tasks configuration setting to increase the number of
reducers, which can offset the performance impact of the grouping operation.
If you need to join tables in the query, consider if you can use a MAP JOIN to minimize searches across
different instances of the mapper. For more details, see the section Resolving Performance Limitations
by Optimizing the Query later in this chapter.
During development, prefix the query with the EXPLAIN keyword. Hive will generate and display the
execution plan for the query that shows the abstract dependency tree, the dependency graph, and
details of each map/reduce stage in the query. This can help you to discover where optimization may
improve performance.

Processing Data with Pig
Pig is a query interface that provides a workflow semantic for processing data in HDInsight. Pig provides a
powerful abstraction layer over map/reduce, and enables you to perform complex processing of your source
data to generate output that is useful for analysis and reporting.
As well as using it to call custom map/reduce code, you can use Pig statements to execute implicit map/reduce
jobs that are generated by the Pig interpreter on your behalf. Pig statements are expressed in a language named
Pig Latin, and generally involve defining relations that contain data, either loaded from a source file or as the
result of a Pig Latin expression on an existing relation. Relations can be thought of as result sets, and can be
based on a schema (which you define in the Pig Latin statement used to create the relation) or can be
completely unstructured.
Pig Latin syntax has some similarities to SQL, and encapsulates many functions and expressions that make it easy
to create a sequence of complex map/reduce jobs with just a few lines of simple code. However, the syntax of
Pig Latin can be complex for non-programmers to master.
Pig Latin is not as familiar or as easy to use as HiveQL, but Pig can achieve some tasks that are difficult,
or even impossible, when using Hive. Pig Latin is great for creating relations and manipulating sets, and for
working with unstructured source data. You can always create a Hive table over the results of a Pig Latin query if
you want table format output.
You can run Pig Latin statements interactively in the interactive console or in a command line Pig shell named
Grunt. You can also combine a sequence of Pig Latin statements in a script that can be executed as a single job,
and use user-defined functions you previously uploaded to HDInsight. The Pig Latin statements are used to
generate map/reduce jobs by the Pig interpreter, but the jobs are not actually generated and executed until you
call either a DUMP statement (which is used to display a relation in the console, and is useful when interactively
testing and debugging Pig Latin code) or a STORE statement (which is used to store a relation as a text file in a
specified folder).
You can build an automated solution that includes Pig queries by using the .NET SDK for
HDInsight, and with a range of Hadoop-related tools such Oozie and WebHCat. For more
information about automating queries, see in Chapter 6 of this guide.
Pig scripts generally save their results as text files in storage, where they can easily be viewed interactively; for
example, in the HDInsight interactive JavaScript console. However, the results can be difficult to consume from
client applications or processes unless you copy the output files and import them into client tools such as
Microsoft Excel.
Pig is a good choice when you need to:
Restructure source data by defining columns, grouping values, or converting columns to rows.
Use a workflow-based approach to process data as a sequence of operations, which is often a logical
way to approach many map/reduce tasks.

Executing a Pig Script
As an example of using Pig, suppose you have a tab-delimited text file containing source data similar to the
following example.
Data
Value1 1
Value2 3
Value3 2
Value1 4
Value3 6
Value1 2
Value2 8
Value2 5
You could process the data in the source file with the following simple Pig Latin script.
Pig Latin
A = LOAD '/mydata/sourcedata.txt' USING PigStorage('\t')
AS (col1, col2:long);
B = GROUP A BY col1;
C = FOREACH B GENERATE group, SUM(A.col2) as total;
D = ORDER D BY total;
STORE D INTO '/mydata/results';
This script loads the tab-delimited data into a relation named A; imposing a schema that consists of two
columns: col1, which uses the default byte array data type, and col2, which is a long integer. The script then
creates a relation named B in which the rows in A are grouped by col1, and then creates a relation named C in
which the col2 value is aggregated for each group in B.
After the data has been aggregated, the script creates a relation named D in which the data is sorted based on
the total that has been generated. D is then stored as a file in the /mydata/results folder, which contains the
following text.
Data
Value1 7
Value2 8
Value3 16
For more information about Pig Latin syntax, see Pig Latin Reference Manual 2 on the Apache Pig website. For a
more detailed description of using Pig see Pig Tutorial.
Maximizing Query Performance in Pig
Pig supports a range of optimization rules. Some are mandatory and cannot be turned off, but you can include
certain information in your code to adjust the behavior and fine tune the query performance. Some of the ways
that you can help Pig to generate the most performant code are:
Reduce the amount of data in the query pipeline by filtering columns or splitting projection operations.
Consider how you can minimize the volume of output from the map phase, such as removing null values
and values that are not required from the output.
Specify the most appropriate data types in your query. For example, if you do not specify a data type for
a numeric value Pig will use the Double type. You may be able to reduce the size of the dataset by using
an Integer or Long data type instead.
The Pig optimizer will attempt to combine operations into fewer statements, but its good practice to do
this yourself as you write the query. For example, avoid generating a set of values in one statement and
passing the set to the next statement if you can generate the required values within the first statement.
Also avoid creating a set of partial strings from a column in one statement and then passing the set to a
FOREACH statement if you can calculate the value inside the FOREACH loop instead.
When joining datasets using a default join (not a replicated, merge, or skewed join) put the dataset
containing the most items last in the JOIN statement. The data in the last dataset listed in a JOIN
statement is streamed instead of being loaded into memory, which can reduce record spills to disk.
Use the LIMIT statement to reduce the number of items to be processed if you do not need all of the
results. The query optimizer automatically applies the LIMIT operation at the earliest possible stage in
the processing.
Use the DISTINCT statement in preference to GROUP BY and GENERATE where possible because it is
much more efficient when extracting unique values from a dataset.
Consider using the multi-query feature of Pig to split the data into branches that can be processed in
parallel as non-blocking tasks. This can reduce the amount of data that must be stored, and also allows
you to generate multiple outputs from a single job. For more information, see Multi-query Performance
on the Apache Pig website.
Consider specifying the number of reducers for a task using a PARALLEL clause. The default is one, and
you may be able to improve performance by specifying a larger number.

A full discussion of optimization techniques is outside the scope of this guide, but useful lists are available in the
Pig Cookbook and in the section Performance and Efficiency on the Apache Pig website. Also see the section
Tuning the Solution later in this chapter, which describes a range of techniques for optimizing query
performance through configuration.
Writing Map/Reduce Code
Map/reduce code consists of two functions; a mapper and a reducer. The mapper is run in parallel on multiple
cluster nodes, each node applying it to its own subset of the data. The output from the mapper function on each
node is then passed to the reducer function, which collates and summarizes the results of the mapper function.
You can implement map/reduce code in Java (which is the native language for map/reduce jobs in all Hadoop
distributions) or in a number of other supported languages, including JavaScript, Python, C# and F#, through
Hadoop Streaming. You can also build an automated solution that includes Hive queries by using the .NET SDK
for HDInsight, and with a range of Hadoop-related tools such Oozie and WebHCat.
You dont need to know Java to use HDInsight. The high-level abstractions such as Hive and Pig will create
the appropriate mapper and reducer components for you, or you can write map and reduce components in other
languages that you are more familiar with.
In most HDInsight processing scenarios it is simpler and more efficient to use a higher-level abstraction such as
Pig or Hive to generate the required map/reduce code on your behalf. However, you might want to consider
creating your own map/reduce code when:
You want to process data that is completely unstructured by parsing it and using custom logic to obtain
structured information from it.
You want to perform complex tasks that are difficult (or impossible) to express in Pig or Hive without
resorting to creating a UDF; for example, using an external geocoding service to convert latitude and
longitude coordinates or IP addresses in the source data to geographical location names.
You already have an algorithm that you want to apply to the code, or you want to reuse some of the
same code (such as the same mapper or reducer component) in different queries.
You want to reuse your existing .NET, Python, or JavaScript code in map/reduce components. This
requires you to use the Hadoop streaming interface (described later in this chapter).
You want to be able to refine and exert full control over the query execution process, such as using a
combiner in the map phase to reduce the size of the map process output.

For more information about using languages other than Java to write map/reduce components, and using the
Hadoop Streaming interface, see the section Using Hadoop Streaming and .NET Languages later in this
chapter. For more information about automating map/reduce queries, see Chapter 6 of this guide.
Creating Map and Reduce Components
The following code sample shows a commonly referenced JavaScript map/reduce example that counts the
words in a source that consist of unstructured text data.
JavaScript
var map = function (key, value, context) {
var words = value.split(/[^a-zA-Z]/);
for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {
context.write(words[i].toLowerCase(), 1);
}
}
};
var reduce = function (key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum += parseInt(values.next());
}
context.write(key, sum);
};
The map function in this script splits the contents of the text input value into an array of strings (using anything
that is not an alphabetic character as a delimiter). Each string in the array is then used as a key with an
associated value of 1.
Each key generated by the map function is passed to the reduce function along with its values, which are added
together to determine the total number of times the key string was included in the source data. In this way, the
two functions work together to create a list of the distinct strings in the source data with the total count of each
string, as shown here.
Data
Aardvark 2
About 7
Above 12
Action 3
...
For more information about writing map/reduce code, see MapReduce Tutorial on the Apache Hadoop website.
Using Hadoop Streaming and .NET Languages
The Hadoop core within HDInsight supports a technology called Hadoop Streaming that allows you to interact
with the map/reduce process and run your own code outside of the Hadoop core. This code replaces the
map/reduce functionality that would usually be implemented using Java components. You can use streaming
when executing a custom map/reduce job or when executing jobs in Hive and Pig.
Figure 1 shows an overview of the way that streaming works.

Figure 1
High level overview of Hadoop Streaming
When using Hadoop Streaming, each node passes the data for the map part of the process to the specified code
or component as a stream, and accepts the results from the code as a stream, instead of internally invoking a
mapper component written in Java. In the same way, the node(s) that execute the reduce process pass the data
as a stream to the specified code or component, and accept the results from the code as a stream, instead of
internally invoking a Java reducer component.
Streaming has the advantage of decoupling the map/reduce functions from the Hadoop core, allowing almost
any type of components to be used to implement the mapper and the reducer. The only requirement is that the
components must be able to read from and write to the standard input and output (stdin and stdout).
You might want to avoid using Hadoop Streaming unless it is absolutely necessary due to the inherent
reduction in job performance that it causes. At the time of writing, this means that you may decide to avoid using
non-Java languages to create map/reduce code unless there is no alternative.
The main disadvantages with using the streaming interface are the effect it has on performance, and the fact
that the data must be in text form. The additional movement of the data over the streaming interface can slow
overall query execution time quite considerably, and so this approach is typically used only when there is no
realistic alternative. Streaming tends to be used mostly to enable the creation of map and reducer components
in languages other than Java. This includes using .NET languages such as C# and F# with HDInsight.
For more details see Hadoop Streaming on the Apache website. For information about writing HDInsight
map/reduce jobs in languages other than Java, see Hadoop on Azure C# Streaming Sample Tutorial and Hadoop
Streaming Alternatives.
Using Hadoop Streaming in HDInsight
If you decide that you must write your map/reduce code using a .NET language such as C#, you can simplify the
task by taking advantage of the .NET SDK for Hadoop. This contains a set of helper classes for creating
map/reduce components, and simplifies the work required to connect all the required components with
HDInsight when executing a job. There is also a set of classes in the MapReduce library that you can use to
abstract many parts of the process.
Figure 2 shows the relevant parts of the .NET SDK for Hadoop.

Figure 2
Using the .NET SDK for Hadoop to create non-Java map/reduce code
For information about using the SDK see Using the Hadoop .NET SDK with the HDInsight Service.
Maximizing Map/Reduce Code Performance
Many techniques for maximizing the runtime performance of map/reduce components are the same as for any
application you create. These techniques include good practice use of logical operators, comparison statements,
and execution flow to create efficient code. Some of the other ways that you can maximize performance are:
Use Hadoop Streaming only when necessary, such as when writing map/reduce components using shell
scripts or in languages other than Java. Hadoop Streaming typically reduces the overall performance of
the job, though this may be an acceptable tradeoff if you actually need to use a language other than
Java.
Minimize storage I/O, memory use, and network bandwidth load by controlling the volume of data that
is generated at each stage during the map and reduce operations. For example, consider generating the
minimum required volume of output from each mapper, such as removing null values or values that are
not required from the output.
Where possible create a combiner component that can be run on every map node to perform a
preliminary aggregation of the results from that map process. You may be able to use the same reducer
component as both a combiner and in the reduce phase of the job.
Choose the most appropriate data types in your code. Avoid using Text types where this can be avoided,
and use the most compact numeric types to suit your data (for example, use integer types instead of
long types where this is appropriate).
If you are using Sequence Files for your jobs, consider using the built-in Hadoop data types instead of
the language-specific types. Hadoop types such as IntWritable and LongWritable typically provide
better performance with Sequence Files, and the numeric types VIntWritable and VLongWritable
automatically persist the values using the least number of bytes.
Try to avoid creating new instances of objects or types (including the basic data types) within a loop or
repeated section of code. Instead create the instance outside of the repeated section and reuse it in all
iterations of that section.
Avoid string manipulation and number formatting if possible because these operations are relatively
slow in Hadoop. Use the StringBuffer class instead of string concatenation.

Also see the section Tuning the Solution later in this chapter, which describes a range of techniques for
optimizing query performance through configuration.
Choosing a Query Mechanism
The previous sections described how Hive, Pig, and custom map and reduce components can be used. The
following table summarizes the advantages and considerations for each one, and will help you to choose the
approach most appropriate for your requirements.
Query mechanism Advantages Considerations
Hive using HiveQL It uses a familiar SQL-like syntax.
It can be used to produce persistent tables
of data that can easily be partitioned and
indexed.
Multiple external tables and views can be
created over the same data.
It supports a simple data warehouse
implementation.
It requires the source data to have at least
some identifiable structure.
Even though it is highly optimized, it might not
produce the most performant map/reduce
code.
It might not be able to carry out some types of
complex processing tasks.
Pig using Pig Latin An excellent solution for manipulating data
as sets, and for restructuring data by
defining columns, grouping values, or
converting columns to rows.
It can use a workflow-based approach as a
sequence of operations on data.
It allows you to fine tune the query to
maximize performance or minimize server
and network load.
Pig Latin is less familiar and more difficult to
use than HiveQL.
The default output is usually a text file, rather
than the table-format file generated by Hive,
and so is more difficult to use with visualization
tools such as Excel.
Not generally suited to ETL operations unless
you layer a Hive table over the output.
Custom map/reduce
components
It provides full control over the map and
reduce phases and execution.
It allows queries to be optimized to achieve
maximum performance from the cluster, or
to minimize the load on the servers and the
network.
Map and reduce components can be written
in a range of widely known languages with
which developers are likely to be familiar.
It is more difficult than using Pig or Hive
because you must create your own map and
reduce components.
Processes that require the joining of sets of
data are more difficult to achieve.
Even though there are test frameworks
available, debugging code is more complex
than a normal application because they run as
a batch job under the control of the Hadoop job
scheduler.
The following table shows some of the more general suggestions that will help you make the appropriate choice
depending on the task requirements.
Requirement Appropriate technologies
Table or dataset joins, or manipulating nested data. Pig and Hive.
Ad-hoc data analysis. Pig, Hadoop Streaming.
SQL-like data analysis and data warehousing. Hive.
Working with binary data or Sequence Files. Java map/reduce components.
Working with existing Java or map/reduce libraries. Java map/reduce components.
Maximum performance for large or recurring jobs. Well-designed Java map/reduce components.
Using scripting or non-Java languages. Hadoop Streaming.
User Defined Functions
Developers often find that they reuse the same code in several locations, and the typical way to optimize this is
to create a user-defined function (UDF) that can be imported into other projects when required. Often a series
of UDFs that accomplish related functions are packaged together in a library so that the library can be imported
into a project, and code can take advantage of any of the UDFs it contains.
You can create individual UDFs and UDF libraries for use with HDInsight queries and transformations. Typically
these are written in Java and compiled, although you can write UDFs in Python to use with Pig (but not with Hive
or custom map/reduce code) if you prefer. The compiled UDF is uploaded to HDInsight and can be referenced
and used in a Hive or Pig script, or in your map/reduce code; however UDFs are most commonly used in Hive
and Pig scripts.
Python can be used to create UDFs but these can only be used in Pig, and the process for creating and
using them differs from using UDFs written in Java.
UDFs can be used not only to centralize code for reuse, but also to perform tasks that are difficult (or even
impossible) in the Hive and Pig scripting languages. For example, a UDF could perform complex validation of
values, concatenation of column values based on complex conditions or formats, aggregation of rows,
replacement of specific values with nulls to prevent errors when processing bad records, and much more.
Creating and Using UDFs in Hive Scripts
In Hive, UDFs are often referred to a plugins. You can create three different types of Hive plugin. A standard UDF
is used to perform operations on a single row of data, such as transforming a single input value into a single
output value. A table generating function (UDTF) is used to perform operations on a single row, but can produce
multiple output rows. An aggregating function (UDAF) is used to perform operations on multiple rows and it
outputs a single row.
Each type of UDF extends a specific Hive class in Java. For example, the standard UDF must extend the built-in
Hive class UDF or GenericUDF and accept text string values (which can be column names or specific text strings).
As a simple example, you can create a UDF named your-udf-name as shown in the following code.
Java
package your-udf-name.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

public final class your-udf-name extends UDF {
public Text evaluate(final Text s) {
// implementation here
// return the result
}
}
The body of the UDF, the evaluate function, accepts one or more string values. Hive passes the values of
columns in the dataset to these parameters at runtime, and the UDF generates a result. This might be a text
string that is returned within the dataset; or a Boolean value if the UDF performs a comparison test against the
values in the parameters. The arguments must be types that Hive can serialize.
After you create and compile the UDF, you upload it to HDInsight using the interactive console or a script with
the following command.
Interactive console command
add jar /path/your-udf-name.jar;
Next you must register it using a command such as the following.
Hive script
CREATE TEMPORARY FUNCTION local-function-name
AS 'package-name.function-name';
You can then use the UDF in your Hive query or transformation. For example, if the UDF returns a text string
value you can use it as shown in the following code to replace the value in the specified column of the dataset
with the value generated by the UDF.
Hive script
SELECT your-udf(column-name) FROM your-data
If the UDF performs a simple task such as reversing the characters in a string, the result would be a dataset
where the value in the specified column of every row would have its contents reversed.
However, the registration only makes the UDF available for the current session, and you will need to re-register
it each time you connect to Hive.
For more information about creating and using standard UDFs in Hive, see HivePlugins on the Apache website.
For more information about creating different types of Hive UDF, see User Defined Functions in Hive, Three
Little Hive UDFs: Part 1, Three Little Hive UDFs: Part 2, and Three Little Hive UDFs: Part 3 on the Oracle website.
Creating and Using UDFs in Pig Scripts
In Pig you can also create different types of UDF. The most common is an evaluation function that extends the
class EvalFunc. The function must accept an input of type Tuple, and return the required result (which can be
null). The following outline shows a simple example.
Java
package your-udf-package;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class your-udf-name extends EvalFunc<String>
{
public String exec(Tuple input) {
// implementation here
// return the result
}
}
As with a Hive UDF, after you create and compile the UDF you upload it to HDInsight using the interactive
console or a script with the following command.
Interactive console command
add jar /path/your-udf-name.jar;
Alternatively, you can just copy your compiled UDF into the pig/bin folder on the cluster through a Remote
Desktop connection. The REGISTER command at the start of a Pig script will then make the UDF available.
You can now use the UDF in your Pig queries and transformations. For example, if the UDF returns the lower-
cased equivalent of the input string, you can use it as shown in this query to generate a list of lower-cased
equivalents of the text strings in the first column of the input data.
Pig script
REGISTER 'your-udf-name.jar';
A = LOAD 'your-data' AS (column1: chararray, column2: int);
B = FOREACH A GENERATE your-udf-name.function-name(column1);
DUMP B;
A second type of UDF in Pig is a filter function that you can use to filter data. A filter function must extend the
class FilterFunc, accept one or more values as Tuples, and return a Boolean value. The UDF can then be used to
filter rows based on values in a specified column of the dataset. For example, if a UDF named IsShortString
returns true for any input value less than five characters in length, you could use the following script to remove
any rows where the first column has a value less than five characters.
Pig script
REGISTER 'your-udf-name.jar';
A = LOAD 'your-data' AS (column1: chararray, column2: int);
B = FOREACH A GENERATE your-udf-name.function-name(column1);
C = FILTER B BY not IsShortString(column1)
DUMP C;
For more information about creating and using UDFs in Pig, see the Pig UDF Manual on the Apache Pig website.
Unifying and Stabilizing Query Jobs with HCatalog
So far in this chapter you have seen how to use Hive, Pig, and custom map/reduce code to process data in an
HDInsight cluster. In each case you use code to project a schema onto data that is stored in a particular location,
and then apply the required logic to filter, transform, summarize, or otherwise process the data to generate the
required results.
The code must load the source data from wherever it is stored, and convert it from its current format to the
required schema. This means that each script must include assumptions about the location and format of the
source data. These assumptions create dependencies that can cause your scripts to break if an administrator
chooses to change the location, format, or schema of the source data.
Additionally, each processing interface (Hive, Pig, or custom map/reduce) requires its own definition of the
source data, and so complex data processes that involve multiple steps in different interfaces require consistent
definitions of the data to be maintained across multiple scripts.
HCatalog provides a tabular abstraction layer that helps unify the way that data is interpreted across processing
interfaces, and provides a consistent way for data to be loaded and stored; regardless of the specific processing
interface being used. This makes it easier to create complex, multi-step data processing solutions that enable
you to operate on the same data by using Hive, Pig, or custom map/reduce code without having to handle
storage details in each script.

HCatalog doesnt change the way scripts and queries work. It just abstracts the details of the data file
location and the schema so that your code becomes less fragile because it has fewer dependencies, and the
solution is also much easier to administer.

An overview of the use of HCatalog as storage abstraction layer is shown in Figure 3.


Figure 3
Unifying different processing mechanisms with HCatalog

Understanding the Benefits of Using HCatalog
To understand the benefits of using HCatalog, consider the typical scenario shown in Figure 4.

Figure 4
An example solution that benefits from using HCatalog
In the example, Hive scripts create the metadata definition for two tables that have different schemas. The table
named mydata is created over some source data uploaded as a file to HDInsight, and this defines a Hive location
for the table data (step 1 in Figure 2). Next, a Pig script reads the data defined in the mydata table, summarizes
it, and stores it back in the second table named mysummary (steps 2 and 3). However, in reality, the Pig script
does not access the Hive tables (which are just metadata definitions). It must access the source data file in
storage, and write the summarized result back into storage, as shown by the dotted arrow in Figure 2.
In Hive, the path or location of these two files is (by default) denoted by the name used when the tables were
created. For example, the following HiveQL shows the definition of the two Hive tables in this scenario, and the
code that loads a data file into the mydata table.
HiveQL
CREATE TABLE mydata (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

LOAD DATA INPATH '/mydata/data.txt' INTO TABLE mydata;

CREATE TABLE mysummary (col1 STRING, col2 BIGINT);
Hive scripts can now access the data simply by using the table names mydata and mysummary. Users can use
HiveQL to query the Hive table without needing to know anything about the underlying file location or the data
format of that file.
However, the Pig script that will group and aggregate the data, and store the results in the mysummary table,
must know both the location and the data format of the files. Without HCatalog, the script must specify the full
path to the mydata table source file, and be aware of the source schema in order to apply an appropriate
schema (which must be defined in the script). In addition, after the processing is complete, the Pig script must
specify the location associated with the mysummary table when storing the result back in storage, as shown in
the following code sample.
Pig Latin
A = LOAD '/mydata/data.txt'
USING PigStorage('\t') AS (col1, col2:long);
...
...
...
STORE X INTO '/mysummary/data.txt';
The file locations, the source format, and the schema are hard-coded in the script, creating some potentially
problematic dependencies. For example, if an administrator moves the data files, or changes the format by
adding a column, the Pig script will fail.
Using HCatalog removes these dependencies by enabling Pig to use the Hive metadata for the tables. To use
HCatalog with Pig you must specify the useHCatalog parameter when starting the grunt client interface tool,
and the path to the HCatalog installation files must be registered as an environment variable named
HCAT_HOME. For example, you could use the following command line statements to launch the grunt interface.
Command Line to start Pig
SET HCAT_HOME = C:\apps\dist\hcatalog-0.4.1
Pig -useHCatalog
With the HCatalog support loaded you can now use the HCatLoader and HCatStorer objects in the Pig script,
enabling you to access the data through the Hive metadata instead of requiring direct access to the data file
storage.
Pig Latin
A = LOAD 'mydata'
USING org.apache.hcatalog.pig.HCatLoader();
...
...
...
STORE X INTO 'mysummary'
USING org.apache.hcatalog.pig.HCatStorer();

HCatalog is built on top of the Hive metastore and uses Hive data definition language
(DDL) to define tables that can be read and written to by Pig and map/reduce code
through the HCatLoader and HCatStorer classes.
The script stores the summarized data in the location denoted for the mysummary table defined in Hive, and so
it can be queried using HiveQL as shown in the following example.
HiveQL
SELECT * FROM mysummary;

Deciding Whether to Adopt HCatalog
The following factors will help you decide whether to incorporate HCatalog in your HDInsight solution:
It makes it easy to abstract the data storage location, format, and schema from the code used to
process it.
It minimizes fragile dependencies between scripts in complex data processing solutions where the same
data is processed by multiple tasks.
It is easy to incorporate into solutions that include Pig scripts, requiring very little extra code.
Additional effort is required to use HCatalog in custom map/reduce components because you must
create your own custom load and store functions.
If you use only Hive scripts and queries, or you are creating a one-shot solution for experimentation
purposes and do not intend to use it again, HCatalog is unlikely to provide any benefit.

Evaluating the Results
After you have used HDInsight to process the source data, you can use the data for analysis and reporting
which forms the foundation for business decision making. However, before making critical business decisions
based on the results you must carefully evaluate them to ensure that they are:
Meaningful. The results when combined and analyzed make sense, and the data values relate to one
another in a meaningful way.
Accurate. The results are correct, with no errors introduced by calculations or transformations applied
during processing.
Useful. The results are applicable to the business decision they will support, and provide relevant
metrics that help inform the decision making process.

You will often need to employ the services of a business user who intimately understands the business context
for the data to perform the role of a data steward and sanity check the results to determine whether or not
they fall within expected parameters. Depending on the complexity of the processing, you might also need to
select a number of data inputs for spot-checking and trace them through the process to ensure that they
produce the expected outcome.
When you are planning to use HDInsight to perform predictive analysis, it can be useful to evaluate the process
against known values. For example, if your goal is to use demographic and historical sales data to determine the
likely revenue for a proposed retail store, you can validate the processing model by using appropriate source
data to predict revenue for an existing store and comparing the resulting prediction to the actual revenue value.
If the results of the data processing you have implemented vary significantly from the actual revenue, then it
seems unlikely that the results for the proposed store will be reliable!
It may not be possible to validate all of the source data for a query, especially if it is collected from
external sources such as social media sites. A good way to measure the validity of the result is to perform the
same query and analysis to predict an already known outcome, such as sales trends over an existing sales region,
and then compare the result with reality. If there is a reasonable match, you can generally assume that your
hypothesis is valid.

Viewing Results
To evaluate the results of your HDInsight data processing solution, you must be able to view them. The way that
you do this depends on the kind of result produced by your queries. For example:
If the result is a dataset containing rows with multiple values, such as the output from a Hive query, you
will probably need to import the data into a suitable visualization tool such as Microsoft Excel in order
to check the validity of the result. Chapter 5 of this guide explains the many options available for
visualizing and analyzing datasets such as this.
If the result is a single value, such as a count of some specific criterion in the source data, it is easy to
view by using the #cat command this in the interactive console, or by opening the result file generated
by your queries in a text editor.
If the result is a set of key/value pairs, you can create a simple chart within the interactive console by
loading the file data and using the JavaScript graph library to visualize it. The following code shows how
you can create a pie chart using this approach.
Interactive JavaScript
file = fs.read("./results/data")
data = parse(file.data, "col1, col2:long")
graph.pie(data);

These commands read the contents of the files in the ./results/data folder into a variable, parse the
variable into two columns (the first using the default chararray data type, the second containing a long
integer), and display the data as a pie chart in the console; as shown in Figure 5.

Figure 5
Viewing Data in the Interactive Console

Tuning the Solution
After you have built an HDInsight solution for Big Data processing, you can monitor its performance and start to
find ways in which you can tune the solution to reduce query processing time or hardware resource utilization.
Irrespective of the way that you construct a query, the language and tools you use, and the way that you submit
the query to HDInsight, the result is execution within the Hadoop core of two Java components: the mapper and
the reducer. Figure 6 shows the stages of the overall process that occur on a data node in the cluster when it
executes both the map and reduce phase of the query (not all servers will execute the reduce phase).

Figure 6
The stages of the map and reduce phase on a data node in an HDInsight cluster
The stages shown in Figure 6 are as follows.
The Hadoop engine on each data node server in the cluster loads the data from its local store into
memory and splits it into segments.
The mapper component executes on each data node server. It scans the data segments and generates
output that it stores in another memory buffer.
As the map process on each data node server completes, the Hadoop engine sorts and partitions these
intermediate results and writes them to storage.
The name node server determines which nodes will execute the reduce algorithm. On each of the
selected servers the Hadoop engine loads the intermediate results from storage into a memory buffer,
combines them with the intermediate results from other servers in the cluster that are not executing
the reduce phase, shuffles and sorts them, and stores them in another memory buffer.
The reducer component on the selected data node servers executes the reduce algorithm, and then
writes the results to an output file.

Its clear from this that the performance of the cluster when executing a query depends on several factors:
The efficiency of the storage system and its connectivity to the server. This affects the rate at which
data can be read from storage into memory, and results written back to storage.
Memory availability. The process requires a number of in-memory buffers to hold intermediate results.
If there is not enough physical memory available, rows are spilled to disk for temporary storage while
the query executes.
The speed of the CPU. The rate at which the mapper and reducer components, and the Hadoop engine,
can execute affects the total processing time for the query.
Network bandwidth and latency between the data node servers in the cluster. During the reduce phase
the output from the mappers must be moved to the server(s) that are executing the reducer
component.

These four factors can be summarized as storage I/O, memory, computation, and network bandwidth. Any of
these could cause a bottleneck in the execution of a query, though typically storage I/O is the most common.
However, you cannot assume this is the case. If you intend to execute a query on a repeatable basis, and the
performance is causing concern, you should attempt to tune the performance of the query by taking advantage
of the comprehensive logging and monitoring capabilities of Hadoop and HDInsight.

Keep in mind that Hadoop is not designed to be a high performance query engine. Its
strength is in its scalability and the consequent ability to handle huge datasets. Unlike
traditional query mechanisms, such as relational database systems, the design of Hadoop
allows almost infinite scale out with near linear gain in performance as nodes are added.

Typically, tuning a query is an iterative process:
Use the log files generated by HDInsight to discover how long each stage of the process takes, and
identify the part of the process that is most likely to cause a bottleneck.
Take steps to remove or minimize the impact of the bottleneck you find. You should attempt to rectify
only a single bottleneck at a time; otherwise you may not be able to measure the effects of the specific
change you make.
Run the query again and use the log files to determine if the change you made has removed or reduced
the impact of the bottleneck. If not, try making a different change and repeat the process.
If the changes you made have removed or minimized the bottleneck, look in the log files for any other
areas that are limiting the efficiency of the process. If you identify another bottleneck, repeat the entire
process to remove or minimize this one.
Continue until you achieve acceptable query processing performance, or until you have minimized all of
the bottlenecks that you can identify.


There is no point is optimizing or performance tuning a query if its a one-off that you dont intend to
execute again, or if the current performance is acceptable. Optimization is a trade-off between the business
value of a gain in performance against the cost and time it takes to carry out the tuning. You will probably find
that a small amount of effort brings large initial gains (the low hanging fruit), but additional effort gives
increasingly less performance gain with each iteration.
Locating Bottlenecks in Query Execution
Performance limitations can arise in any of the four main areas of storage I/O, memory, computation, and
network bandwidth. In addition, each of these areas comes into play at several stages of the query process, as
you saw in Figure 6. For example, its possible that insufficient memory in the server could cause rows to be
spilled to disk rather than all held in memory during the map phase. However, the moderate reduction of
storage I/O delay provided by using faster disks with a larger onboard cache will probably provide much less
performance gain compared to adding more memory to the server so that rows do not need to be spilled.
Its also possible that the existing hardware can execute the query much more quickly if you can optimize the
map and reduce components so that, for example, less intermediate rows need to be buffered during the map
phase which would remove the requirement to spill rows to disk with the existing memory capacity of the
servers. Youll see more information about optimizing query design later in this chapter.
After you identify a bottleneck, and before you start modifying configuration properties or extending the
hardware by adding more nodes, think about how you might be able to optimize the query to remove or
minimize the bottleneck. For example, if you write the query in a high-level language such as HiveQL you may be
able to boost performance by changing the query or by splitting it into two separate queries.
Accessing Performance Information
Most of the performance information you will use to monitor query performance is available from two portals
that are part of the Hadoop system. The Job Tracker portal is accessed on port 50030 of your clusters head
node. The Health and Status Monitor is accessed on port 50070 of your clusters head node.
The easiest way to use them is to open a Remote Desktop connection to the cluster from the main Windows
Azure HDInsight portal. The link named Hadoop MapReduce Status located on the desktop of the remote virtual
server provides access to the Job Tracker portal, while the link named Hadoop Name Node Status opens the
Health and Status Monitor portal.
To view the counters for a job in the Job Tracker portal scroll to the Completed Jobs section of the Map/Reduce
Administration page and select the job ID. This opens the History Viewer page for that job, which displays a
wealth of information. It includes the number of mapper and reducer tasks; and full details of the job counters,
input and output bytes, file system reads and writes, and details of the map/reduce process. For any job you
select in the History Viewer page you can view all the configuration settings that job used by selecting the link to
the job configuration XML file.
In addition to the combined information from all of the cluster nodes for a job, you can view history details for a
job on each individual cluster node. In the Map/Reduce Administration page of the Job Tracker select the link in
the Nodes column of the table that contains the Cluster Summary, and then select the node you are interested
in. This opens the local Job Tracker portal on that cluster node.
For more information about the configuration files in HDInsight, and how you set configuration property values,
see the section Configuring HDInsight Clusters and Jobs in Chapter 7 of this guide.
Identifying Storage I/O Limitations
Storage I/O performance on data node servers is often the most likely candidate for limiting query performance,
due simply to the physical nature of the storage mechanism. Data must be loaded from and written to storage in
each map phase, together with any spilt rows, and this is usually where problems arise when there is a huge
volume of data to process. Data is also loaded from and written to storage during each reduce phase, although
this is usually a much smaller volume of data.
Storage I/O can also be an issue on the head node of a cluster if cluster configuration and management data
must be repeatedly read from disk, and updates written to disk. This may occur when there is a shortage of RAM
on the head node, or there are a great many data file segments to manage. Windows Azure HDInsight uses solid-
state hard disk drives (SSD) on the head node to maximize cluster management performance.
To detect when storage I/O performance might slow query execution, examine the values of the
ASV_BYTES_READ and HDFS_BYTES_READ counters in the job counters log for the map phase (in the History
Viewer page for the job in the Job Tracker portal). These counters show the number of bytes loaded into
memory at the start of the map phase, and indicate the amount of data that was loaded by the cluster at the
start of query execution.
The values of the FILE_BYTES_WRITTEN counter for the map phase and the FILE_BYTES_READ counter for the
reduce phase indicate the amount of data that was written to disk at the end of the map phase and loaded from
disk at the start of the reduce phase.
The values of the HDFS_BYTES_WRITTEN counters in the job counters log for the reduce phase indicate the
amount of data that was written to disk after the query completed.
To minimize the effect of storage I/O limitations, consider optimizing the query, compressing the data, and
adjusting the memory configuration of the cluster servers. See the later sections of this chapter for more details.
Identifying Memory Limitations
When you configure an HDInsight cluster, the amount of RAM available to Hadoop on each server is set to a
default value that you do not need to explicitly specify. However, a shortage of RAM can slow query execution
by causing records to be spilled to disk when the memory buffers are full. The additional storage I/O operations
are likely to severely affect query execution times. In addition, the head node must have sufficient RAM to hold
the lookup tables and other data associated with managing the cluster and the query processes on each server.
To detect when memory shortage might slow query execution examine the values of the counters
FILE_BYTES_READ, FILE_BYTES_WRITTEN, and Spilled Records in the job counters log (in the History Viewer
page for the job in the Job Tracker portal). The value for the counter FILE_BYTES_WRITTEN for the map phase
should be approximately equal to the value of FILE_BYTES_READ during the reduce phase, indicating that there
was no overflow of the memory buffer.
Its less common for spilled records to occur from a reduce task, but it may happen if the input to the reduce
phase is very large. The same job counters as you use for the map phase provide values for the reduce phase as
well.
The Windows performance counters for memory usage on each server provide a useful indication of the
memory limitations as the query executes. You may also see OutOfMemory errors generated as the query runs,
and in the results log, when there is an extreme shortage of memory.
To minimize the effect of memory limitations, consider adjusting the configuration of the cluster. See the later
sections of this chapter for more details.
Identifying Computation Limitations
Modern CPUs are so much faster than peripheral equipment such as disk storage and (to a lesser extent)
memory. Hadoop query execution speed is unlikely to be limited by CPU overload. Its more likely that CPUs on
the cluster servers are underutilized, though this obviously depends on the actual CPU is use.
Its possible to configure the number of map and reduce task instances that Hadoop will execute for a query by
setting the mapred.map.tasks and mapred.reduce.tasks configuration properties. If there are no other
bottlenecks, such as storage I/O or memory, that are limiting the performance of the query you may be able to
reduce the execution time by specifying additional instances of map and reduce tasks within the cluster.
The Map/Reduce Administration page of the Job Tracker portal shows the total number of slots available for
instances of map and reduce tasks. The History Viewer page for each job shows the actual number of instances
of each task. Note that, if you are using Hive to execute the query and that query contains an ORDER BY clause,
only one reducer task instance will be executed by default.
Underutilization of the CPU can also occur when you use more than one reducer task instance and the
distribution of data between them is uneven. This occurs when some map tasks create much larger intermediate
results sets than others. To identify this condition view the job counters for the job in the History Viewer page
on each individual cluster node, and compare the values of the Reduce input records counters.
To minimize the effect of computation limitations, consider adjusting the configuration of the cluster and
increasing the number of data nodes in the cluster. See the later sections of this chapter for more details.
Identifying Network Bandwidth Limitations
Hadoop must pass data and information between nodes as the query executes. For example, the head node
passes instructions to each data node to manage and synchronize the data movement operations that occur on
each server, and to initiate execution of the map and reduce phases. In addition, the intermediate data from the
map phase on each server must pass to the selected server(s) that will execute the reduce phase.
You can determine the number of input records for the reduce phase on each server in the cluster by examining
the value of the Reduce input records counter for each job (in the History Viewer page for the job in the Job
Tracker portal).
To minimize the effect of network bandwidth limitations, consider compressing the data and optimizing the
query. See the following sections of this chapter for more details.
Overview of Tuning Techniques
The following table summarizes the major performance issues that you might detect for a query, and provides a
pointer to the type of solution you can apply to resolve these issues. The subsequent sections of this chapter
provide more information on how you can apply the techniques shown in the table.
Bottleneck or issue Tuning techniques
Massive I/O caused by a large volume of source data. Compress the source data.
Massive I/O caused by a large number of spilled
records during the map phase.
Refine the query to minimize the map output, or allocate additional
memory for the buffer used for sorting records in the map phase.
Massive I/O or network traffic caused by a large volume
of data generated by the map phase.
Refine the query to minimize the map output, compress the map
output, or implement a combiner that reduces the volume of the
map output phase.
Massive IO and network traffic caused by a large
volume of data output from the final reduce phase.
Compress the job output or reduce the replication factor for the
cluster (the number of times data is replicated).
Additionally, for a Hive query, compress the intermediate output
from the map phase.
Insufficient running tasks caused by improper
configuration
Increase the number of map/reduce tasks and slots to take full
advantage of the available slots on each data node.
Insufficient memory causing the job to hang. Adjust the memory allocation for the tasks.
Uneven distribution of load on the reducers causing one
or more to run much longer than the rest.
Refine the query to reduce the skewing of the results from the map
phase on each machine to provide more even data distribution.
For a Hive query, enable the automatic Map-Join feature.
A Hive query containing an ORDER BY clause forces
use of a single reducer.
Avoid global sorting by removing the ORDER BY clause and then
sort the data using another query, or write the query using Pig Latin
(which is much more efficient for global sorting).

Resolving Performance Limitations by Optimizing the Query
Performance tuning a query is often as much about the way that the query is written as it is about the
configuration and capabilities of the cluster software and hardware. Its possible, if the query is shown to be
overloading parts of the hardware, that you can split it into two or more queries and persist the intermediate
results between them to minimize memory usage.
You may even be able to store summarized or filtered intermediate results in a form that allows them to be
reused by more than one query, giving an overall gain in performance over several queries even if one takes a
little longer due to storage I/O limitations.
If you are using custom map and reduce components, you might consider implementing a combiner class as part
of the reducer component. The combiner aggregates the output from the map phase before it passes to the
reducer, which means that the reducer will operate more efficiently because the dataset is smaller. You can also
use the reducer as a combiner. This means that you effectively perform the reduce phase on each data node
after the map operation completes, before Hadoop collects the output from each data node and submits them
to the reducer on the selected data node(s) for the final reduce phase.
The values of the Combine input records and Combine output records job counters (in the History Viewer page
for the job in the Job Tracker portal) show the number of records passed into and out of the combiner, which
indicates how efficient the combiner is in reducing the size of the dataset passed to the reducer function.
For information about creating a combiner, see Hadoop Map/Reduce Tutorial in the Apache online
documentation.
In most HiveQL queries the map job is performed by multiple mappers, and the reduce job is performed by a
single reducer. However, you can control the number of reducers by setting the mapred.reduce.tasks property
to the number of reducers you want to use, or by setting the hive.exec.reducers.bytes.per.reducer property to
control the maximum number of bytes that can be processed by a single reducer.
If you are using a HiveQL query that contains an ORDER BY clause, consider how you might be able to remove
this clause to allow Hadoop to execute more than one reducer instance. You may be able to sort the results
using a separate HDInsight query, or export the data and then sort it afterwards.
HiveQL also supports joins between tables, but these can be very slow for large datasets. Instead, if one of the
tables is relatively small (such as a lookup table), you can use a MAPJOIN operation to perform the join in each
mapper instance so that the join no longer takes place in the reduce phase. You can also enable the Auto Map-
Join feature by setting the value of the Hadoop configuration property hive.auto.convert.join to true, and
specifying the size of the largest table that will be treated as a small table in the hive.smalltable.filesize
property. You can set these properties dynamically for a specific query as shown in the following code sample.
HiveQL
SET hive.auto.convert.join=true;
SET hive.smalltable.filesize=25000000;

SELECT b.col2, s.col2,
FROM bigtable b JOIN smalltable s
ON b.col1 = s.col1;
Alternatively, you can edit these properties in the configuration file hive-site.xml in the apps\dist\hive-
[version]\conf\ folder.
For more information about the MAPJOIN operator see Hive Joins in the Apache documentation. For
information about editing configuration settings, see the section Configuring HDInsight Clusters and Jobs
earlier in this chapter.
The format of the files you create for use in your queries can also have an impact on the performance of jobs. By
using Sequence Files instead of text files you can often improve the efficiency and reduce the execution time for
jobs. Sequence Files support the built-in Hadoop data types such as VIntWritable and VLongWritable that
minimize memory usage and provide good write performance. Sequence File are also splittable, which better
supports parallel processing on multiple map nodes, and they work well with block-level compression between
the stages of a map/reduce job.
If you have a chain of jobs you can typically output the data from the first job and any intermediate jobs in
Sequence File format, and then generate the final output for storage in an appropriate format such as a text file.
Resolving Performance Limitations by Compressing the Data
Bottlenecks caused by storage I/O limitations can be minimized in two ways. You can optimize the query to
reduce the amount of data being written to and read from disk in the map and reduce phases, as described in
the previous section of this chapter. However, if this is not possible, consider compressing the data to reduce
the number of bytes that must be written and read.
Data compression will increase CPU load as the data is decompressed and compressed at each stage, but
this is typically not an issue as storage I/O limitations are more likely than computation limitations on modern
servers.
You can compress the source data, the map output data, and the job output data (the output from the reduce
phase). You can compress data that is used by a query by applying a codec. HDInsight provides three different
codecs: DEFLATE, GZip, and Bzip2.
The largest gains in performance are likely to come from compressing the source data (which also reduces the
storage space used) and the map output data because these are the biggest datasets. Compressing the map
output data can also reduce the load on the network if investigation of the job counters and Windows
performance counters indicate a network bandwidth limitation is slowing query execution at the start of the
reduce phase.
For more information about the codecs provided with Hadoop, and how you can compress the source data in
HDInsight, see the section Compressing the Source Data in Chapter 3.
To enable compression for data within the query process you can use the following configuration properties:
Set the mapred.compress.map.output property to true to enable compression of the data output to
disk at the end of the map phase, and set the mapred.map.output.compress.codec property to the
name of the codec you want to use.
For very complex HiveQL operations, which can generate multiple map operations on each server, set
the hive.exec.compress.intermediate to true so that data passed between each of these map
operations is compressed.
Set the mapred.output.compress property to true to enable compression of the data output from the
reduce stage (the job output data), and set the mapred.output.compress.codec property to the name
of the codec you want to use.
The default compression type is RECORD, which bases the compression algorithm on individual records.
For a higher compression rate, but with an increased impact on performance, set the
mapred.output.compression.type property to BLOCK so that the compression algorithm uses entire
blocks of data.

You can set these properties dynamically in a query. For example, the following HiveQL query compresses the
intermediate query results and, when applied to large volumes of data (typically over 1.5 GB), can improve
query performance significantly.
HiveQL
SET mapred.compress.map.output=true;
SET hive.exec.compress.intermediate=true;
SET mapred.map.output.compression.codec
=org.apache.hadoop.io.compress.GzipCodec;

SELECT col1, SUM(col2) total,
FROM mytable
GROUP BY col1;
Alternatively, you can add the mapred.compress.map.output, mapred.output.compress,
mapred.map.output.compress.codec, and mapred.output.compression.type properties to the conf\mapred-
site.xml configuration file; and add the hive.exec.compress.intermediate to the hive-site.xml configuration file
in the apps\dist\hive-[version]\conf\ folder.

You can change the compression settings when you submit a query by using a SET statement
or by including -D parameters in the Hadoop command line. For information about changing
configuration settings, see the section Configuring HDInsight Clusters and Jobs in Chapter 7
of this guide.

Resolving Performance Limitations by Adjusting the Cluster Configuration
When executing a query, the number of instances of the map and reduce tasks significantly affects performance.
You can change these settings to maximize the performance of a query, but its vital to determine when this is
appropriate. Specifying too many can reduce performance just as much as specifying too few. You should use
the values in the job counters and other logs to decide if the number should be changed, and to measure the
effects of the changes you make.
The two Hadoop properties mapred.map.tasks and mapred.reduce.tasks specify the number of map and
reduce task instances. In most cases you should set these dynamically at runtime, though you can also add them
to the conf\mapred-site.xml configuration file.
You can also change the number of map and reduce slots available in the cluster, which gives Hadoop more
freedom to automatically allocate additional map and reduce tasks to a query, by setting the
mapred.tasktracker.reduce.tasks.maximum and mapred.tasktracker.map.tasks.maximum property values.

If you need to change the memory configuration settings for the servers in a cluster, perhaps to reduce the
number of spilled records, you can set the following configuration properties:
The io.sort.mb property. This property specifies the size of memory in megabytes of the buffer used
when sorting the output of the map operations. The default in the src\mapred\mapred-default.xml file
is 100.
The io.sort.record.percent property. This property specifies the proportion of the memory buffer
defined in io.sort.mb that is reserved for storing record boundaries of the map outputs (typically 16
bytes per record). The remaining space is used for the map output records. The default in the
src\mapred\mapred-default.xml file is 0.05 (5%).
The io.sort.spill.percent property. This property specifies the limit for the sort buffer when Hadoop will
spill additional data to disk. It applies to both the map output memory buffer and the record boundaries
index. The default in the src\mapred\mapred-default.xml file is 0.80 (80%), and you may gain some
additional performance by setting it to 1.

Another factor that can affect performance through storage I/O and network load is the replication of the
output from a query, which occurs to provide redundancy should a cluster node fail. The default is to create
three replicas of the output. If you do not need the output to be replicated you can set the value of the
dfs.replication property at runtime, or edit it in the conf\hdfs-site.xml configuration file.
Resolving Performance Limitations by Adding More Servers to the Cluster
As a last resort, if you cannot achieve acceptable performance using any of the methods previously described,
you may need to consider adding more servers to the cluster. This will typically require you to destroy and
recreate the cluster, and reload the data. Adding servers to a cluster will also increase the runtime cost.
If you are using Hadoop as a basic data warehouse this can be problematic as you will have to recreate all of the
Hive table definitions and any intermediate datasets used by the cluster. However, in HDInsight you can create a
new cluster that takes its data from an existing storage account, which contains the data used by the previous
cluster. This will reduce the work involved, although you will still need to recreate the table definitions.

For more information about recreating a cluster using the existing data, see the section Retaining Data and
Metadata for a Cluster in Chapter 7 of this guide.

Summary
This chapter described the core tasks when working with HDInsight: creating queries and transformations for
data. The number of tools and technologies associated with Hadoop is growing by the day, but only some of
these are fully supported in HDInsight. This chapter concentrates on these available tools and technologies,
which account for the vast majority of tasks you are likely to need to perform.
At the time of writing, Hive accounted for around 80% of all HDInsight jobs and so this chapter concentrates first
on explaining what Hive is and what you can do with it. The chapter includes only a rudimentary description of
how you can use Hive; in just enough depth to introduce the concepts, make you aware of the issues to consider
when using it, and to help you recognize Hive queries as you learn more from the huge reservoir of knowledge
and examples available on the web.
Next, the chapter covers the other two main query technologies Pig and custom map/reduce, in enough depth
for you to understand their usage and purpose. The chapter also covers subsidiary topics related to Big Data jobs
such as using the Hadoop Streaming interface to write map/reduce code in non-Java languages, using user-
defined functions (UDFs) within your queries, and unifying your queries by using HCatalog.
Finally, the chapter covers the next two related stages of the typical Big Data query cycle: evaluating the results
to ensure they are valid and useful, and tuning the solution.
More Information
For more information about HDInsight, see the Windows Azure HDInsight web page.
A central point for TechNet articles about HDInsight is HDInsight Services For Windows.
For examples of how you can use HDInsight, see the following tutorials on the HDInsight website:
Using Hive with HDInsight
Using Pig with HDInsight
Using MapReduce with HDInsight

For more information about maximizing performance in HDInsight see the following Microsoft White Papers:
Hadoop Job Optimization
Performance of Hadoop on Windows in Hyper-V Environments



Chapter 5: Consuming and Visualizing the
Results
This chapter describes a variety of ways in which you can consume data from HDInsight for analysis and
reporting. It explores the typical tools that are part of the Microsoft data platform, and that can integrate
directly or indirectly with HDInsight. It also demonstrates how you can generate useful and attractive
visualizations of the information you generate using HDInsight.
Challenges
Unlike a traditional relational database system, HDInsight does not automatically expose its data in the standard
row and column format. For example, the results of a query may be just an output file stored in Windows Azure
blob storage. In a typical Big Data analysis solution the results of custom map/reduce or Pig processing are
generated as files in the HDFS file system (which, for HDInsight on Windows Azure, is hosted in Windows Azure
blob storage). One of the challenges is to be able to connect to and access this data in order to consume it in an
appropriate tool or application.
HDInsight can, however, generate the equivalent to a row and column dataset, which is typically a better match
to most data analysis tools and for import into another database system. It is common to create Hive tables over
data stored in files in order to provide a query interface to that data. You can then use the ODBC driver for Hive
to consume the results of an HDInsight query, even when the data was originally generated by a Pig or
map/reduce operation during the earlier stages of the HDInsight solution.

When you generate results in HDInsight that you want to consume using other tools, you should consider
generating these results in a format that will best suit your target tools. The format should provide an efficient
import process that accurately copies the data without affecting the values or formatting in a way that could
introduce errors. Typically, you will layer a Hive query over the results of other queries so that the data is
available in the traditional row and column format.
A second challenge arises when you want to consume the results of a query process in an enterprise database or
business intelligence (BI) system, as discussed in Chapter 2 of this guide. In addition to performing the
appropriate data cleansing and pre-processing tasks, you must consider how to create a data export process
that can successfully push the data into the database or BI system, and how you can automate this process.
You may decide to automate some or all of the stages of consuming data that are described in this chapter.
Chapter 6 of this guide provides details or how you can create automated solutions using HDInsight and a range
of other tools and frameworks.
Options for Consuming Data from HDInsight
The previous chapter of this guide described two rudimentary ways to view the result of an HDInsight query to
validate that it produces meaningful and useful results. It demonstrated how you can use the interactive console
in the HDInsight management dashboard to view the contents of the data files, and to create simple graphical
data visualizations.
These techniques are useful during the development of an HDInsight solution, and in some cases might provide
a usable solution for simple analysis. However, in most enterprise environments the results of HDInsight data
processing are consumed in analytical or reporting tools that enable business users to explore and visualize the
insights gained from the Big Data process.
This chapter explores the options for consuming data processing results from HDInsight, and provides guidance
about using each option in the context of the architectural patterns for HDInsight solutions described earlier in
this guide.
The topics you will see discussed in this chapter are:
Consuming HDInsight data in Microsoft Excel.
Consuming HDInsight data in SQL Server Reporting Services (SSRS) and Windows Azure SQL Reporting.
Consuming HDInsight data in SQL Server Analysis Services (SSAS).
Consuming HDInsight data in a SQL Server database.
Custom and third party analysis and visualization tools.


There is a huge range of data analysis and BI tools available from Microsoft and from third
parties that you can use with your data warehouse and HDInsight systems. Microsoft
products include server-based solutions such as SQL Server Reporting Services and
SharePoint Server, cloud-based solutions such as Windows Azure SQL Reporting, and
desktop or Office applications such as Excel and Word.

A high level diagram of the options for consuming data from HDInsight that are discussed in this chapter is
shown in Figure 1.

Figure 1
Options for consuming data from HDInsight
Figure 1 represents a map that shows the various routes through which data can be delivered from an
HDInsight cluster to client applications. The client application destinations shown on the map include
Microsoft Excel, SQL Server Reporting Services (SSRS), and Windows Azure SQL Reporting; however, you can also
create custom applications or use common enterprise BI data visualization technologies such as
PerformancePoint Services in Microsoft SharePoint Server to consume data from HDInsight, directly or
indirectly, by using the same interfaces.
Consuming HDInsight Data in Microsoft Excel
Microsoft Excel is one of the most widely used software applications in the world, and is commonly used as a
tool for interactive data analysis and reporting. It supports a wide range of data import and connectivity options,
including:
Built-in data connectivity to a wide range of data sources.
The Data Explorer add-In.
PowerPivot.

Excel includes a range of analytical tools and visualizations that you can apply to tables of data in one or more
worksheets within a workbook. Additionally, all Excel 2013 workbooks encapsulate a data model in which you
can define tables of data and relationships between them. These data models make it easier to slice and dice
data in PivotTables and PivotCharts, and to create Power View visualizations.
Excel is especially useful when you want to add value and insight by augmenting the results of your data analysis
with external data. For example, you may perform an analysis of social media sentiment data by geographical
region in HDInsight, and consume the results in Excel. This geographically oriented data can be enhanced by
subscribing to a demographic dataset in the Windows Azure Data Market, from which socio-economic and
population data may provide an insight into why your organization is more popular in some locations than in
others.
As well as datasets that you can download and use to augment your results, the Windows
Azure Data Market includes a number of data services that you can use for data validation
(for example, verifying that telephone numbers and postal codes are valid) and for data
transformation (for example, looking up the country or region, state, and city for a
particular IP address or longitude/latitude value).
Built-In Data Connectivity
Excel includes native support for importing data from a wide range of data sources including web sites, OData
feeds, relational databases, and more. These data connectivity options are available on the Data tab of the
ribbon, and enable business users to specify data source connection information and select the data to be
imported through a wizard interface. The data is imported into a worksheet, and in Excel 2013 it can be added
to the workbook data model and used for data visualization in a PivotTable Report, PivotChart, Power View
Report, or GeoFlow map.
The built-in data connectivity capabilities of Excel include support for ODBC sources, and are an appropriate
option for consuming data from HDInsight when the data is accessible through Hive tables. To use this option
the Hive ODBC driver must be installed on the client computer where Excel will be used.
There is a 32-bit and a 64-bit version of the driver available for download from Microsoft
ODBC Driver For Hive. You should install the one that matches the version of Windows on the
target computer. When you install the 64-bit version it also installs the 32-bit version so you
will be able to use it to connect to Hive from both 64-bit and 32-bit applications.
You can simplify the process of connecting to HDInsight by using the Data Sources (ODBC) administrative tool to
create a data source name (DSN) that encapsulates the ODBC connection information, as shown in Figure 2.
Creating a DSN makes it easier for business users with limited experience of configuring data connections to
import data from Hive tables defined in HDInsight. If you set up both 32-bit and 64-bit DSNs using the same
name, clients will use the appropriate one automatically.

Figure 2
Creating an ODBC DSN for Hive
Almost all of the techniques for consuming data from HDInsight rely on the ODBC driver that you can
obtain and install from the Windows Azure HDInsight management portal. The ODBC driver provides a
standardized data exchange mechanism that works with most Microsoft tools and applications, and with tools
and applications from many third party vendors. It also makes it easy to consume the data in your own custom
applications.

After you create a DSN, users can specify it in the data connection wizard in Excel and then select a Hive table or
view to be imported into the workbook, as shown in Figure 3.

Figure 3
Importing a Hive table into Excel
The Hive tables shown in Figure 3 are used as an example throughout this chapter. These
tables contain UK meteorological observation data for 2012, obtained from the UK Met
Office Weather Open Data dataset in the Windows Azure Data Market at
https://datamarket.azure.com/dataset/datagovuk/metofficeweatheropendata.
Using the built-in data connectivity capabilities of Excel is a good solution when business users need to perform
simple analysis and reporting based on the results of an HDInsight query, as long as the data can be
encapsulated in Hive tables. Data import can be performed with any client computer on which the ODBC driver
for Hive is installed, and is supported in all editions of Excel.
In Excel 2013 users can add the imported data to the workbook data model and combine it with data from other
sources to create analytical mashups. However, in scenarios where complex data modeling is the primary goal,
using an edition of Excel that supports PowerPivot offers greater flexibility and modeling functionality and
makes it easier to create PivotTable Report, PivotChart, Power View, and GeoFlow visualizations.
The following table describes specific considerations for using Excels built-in data connectivity in the HDInsight
solution architectures described earlier in this guide.
Architectural Model Considerations
Standalone
experimentation and
visualization
Built-in data connectivity in Excel is a suitable choice when the results of the data processing can
be encapsulated in a Hive table, or a query with simple joins can be encapsulated in a Hive view,
and the volume of data is sufficiently small to support interactive connectivity with tolerable
response times.
Data transfer, data
cleansing, and ETL
Most ETL scenarios are designed to transform Big Data into a suitable structure and volume for
storage in a relational data source for further analysis and querying. While Excel may be used to
consume the data from the relational data source after it has been transferred from HDInsight, it is
unlikely that an Excel workbook would be the direct target for the ETL process.
Basic data warehouse or
commodity storage
When HDInsight is used to create a basic data warehouse containing Hive tables, business users
can use the built-in data connectivity in Excel to consume data from those tables for analysis and
reporting. However, for complex data models that require multiple related tables and queries with
complex joins, PowerPivot may be a better choice.
Integration with
Enterprise BI
Importing data from a Hive table and combining it with data from a BI data source (such as a
relational data warehouse or corporate data model) is an effective way to accomplish report-level
integration with an enterprise BI solution. However, in self-service analysis scenarios advanced
users such as business analysts may require a more comprehensive data modeling solution such
as that offered by PowerPivot.

While Excel can, itself, provide a platform for wide-ranging exploration and visualization of data, the
powerful add-in tools such as the Data Explorer, PowerPivot, Power View, and GeoFlow can make this task easier
and help you to create more meaningful and immersive results.
The Data Explorer Add-In
The Data Explorer add-in enhances Excel by providing a comprehensive interface for querying a wide range of
data sources, and exploring the results. Data Explorer includes a data source provider for Windows Azure
HDInsight, which enables users to browse the HDFS folders in Windows Azure blob storage that are associated
with an HDInsight cluster. Users can then import data from files such as those generated by map/reduce jobs
and Pig scripts, and import the underlying data files associated with Hive tables.
After the Data Explorer add-in has been installed, the From Other Sources option on the Data Explorer tab in
the Excel ribbon can be used to import data from Windows Azure HDInsight. The user specifies the account
name and key of the Azure blob store, not the HDInsight cluster itself. This enables organizations to use this
technique to consume the results of HDInsight processing even if the cluster has been released, significantly
reducing costs if no further HDInsight processing is required.
Keeping an HDInsight cluster running when it is not executing queries just so that you can access the data
incurs charges to your Windows Azure account. If you are not using the cluster, you can close it down but still be
able to access the data at any time using the Data Explorer in Excel, or any other tool that can access Windows
Azure blob storage.
After connecting to the Azure blob store the user can select a data file, convert its contents to a table by
specifying the appropriate delimiter, and modify the data types of the columns before importing it into a
worksheet, as shown in Figure 4.

Figure 4
Importing data from HDInsight with the Data Explorer add-in
The imported data can be added to the workbook data model, or analyzed directly in the worksheet. The
following table describes specific considerations for using Data Explorer in the HDInsight solution architectures
described earlier in this guide.
Architectural Model Considerations
Standalone
experimentation and
visualization
Data Explorer is a good choice when HDInsight data processing techniques such as map/reduce
jobs or Pig scripts generate files that contain the results to be analyzed or reported. The HDInsight
cluster can be deleted after the processing is complete, leaving the results in the Azure blob
storage ready to be consumed by business users in Excel.
Data transfer, data
cleansing, and ETL
Most ETL scenarios are designed to transform Big Data into a suitable structure and volume for
storage in a relational data source for further analysis and querying. It is unlikely that Data
Explorer would be used to consume data files from the Windows Azure blob storage associated
with the HDInsight cluster, though it may be used to consume data from the relational data store
loaded by the HDInsight ETL process.
Basic data warehouse or
commodity storage
When HDInsight is used to implement a basic data warehouse it usually includes a schema of
Hive tables that are queried over ODBC connections. While it is technically possible to use Data
Explorer to import data from the files on which the Hive tables are based, more complex Hive
solutions that include partitioned tables would be difficult to consume this way.
Integration with
Enterprise BI
Importing data from a file in Windows Azure blob storage and combining it with data from a BI data
source (such as a relational data warehouse or corporate data model) is an effective way to
accomplish report-level integration with an enterprise BI solution. However, in self-service analysis
scenarios advanced users such as business analysts may require a more comprehensive data
modeling solution such as that offered by PowerPivot.

At the time this guide was written the Data Explorer add-in was a preview version. You can download it from the
Microsoft Download Center.

PowerPivot
The growing awareness of the value of decisions based on proven data, combined with advances in data analysis
tools and techniques, has resulted in an increased demand for versatile analytical data models that support ad-
hoc analysis (the self-service approach).
PowerPivot is an Excel-based technology in Microsoft Office 2013 Professional Plus that enables advanced users
to create complex data models that include hierarchies for drill-up/drill-down aggregations, custom data access
expression (DAX) calculated measures, key performance indicators (KPIs) and other features not available in
basic data models. PowerPivot is also available as an add-in for previous releases of Microsoft Excel. PowerPivot
uses xVelocity compression technology to support in-memory data models that enable high-performance
analysis, even with extremely large volumes of data.

Users can create and edit PowerPivot data models by using the PowerPivot Manager interface, which is
accessed from the PowerPivot tab of the ribbon in Excel. Through this interface, users can enhance tables that
have been added to the workbook data model by other means, and also import multiple tables from one or
more data sources into the data model and define relationships between them. Figure 5 shows a data model in
PowerPivot Manager.


Figure 5
PowerPivot Manager

PowerPivot brings many of the capabilities of enterprise BI to Excel, enabling business analysts to create
personal data models for sophisticated self-service data analysis. Users can share PowerPivot workbooks
through SharePoint Server where they can be viewed interactively in a browser, enabling teams of analysts to
collaborate on data analysis and reporting.

The following table describes specific considerations for using PowerPivot in the HDInsight solution architectures
described earlier in this guide.

Architectural Model Considerations
Standalone
experimentation and
visualization
PowerPivot is a good choice when HDInsight data processing results can be
encapsulated in Hive tables and imported directly into the PowerPivot data
model over an ODBC connection, or when output files can be imported into the
Excel data model using Data Explorer. PowerPivot enables business analysts to
enhance the data model and share it across teams through SharePoint Server.
Data transfer, data
cleansing, and ETL
Most ETL scenarios are designed to transform Big Data into a suitable structure
and volume for storage in a relational data source for further analysis and
querying. In this scenario, PowerPivot may be used to consume data from the
relational data store loaded by the ETL process.
Basic data warehouse or
commodity storage
When HDInsight is used to implement a basic data warehouse it usually includes
a schema of Hive tables that are queried over ODBC connections. PowerPivot
makes it easy to import multiple Hive tables into a data model, define
relationships between them, and refresh the model with new data as the data
files in HDInsight are updated.
Integration with
Enterprise BI
PowerPivot is a core feature in Microsoft-based BI solutions, and enables self-
service analysis that combines enterprise BI data with Big Data processing
results from HDInsight at the report level.

Visualizing Data in Excel
After the data has been imported using any of the previously described techniques, business users can employ
the full range of Excel analytical and visualization functionality to explore the data and create reports that
include color-coded data values, charts, indicators, and interactive slicers and timelines.


PivotTables and PivotCharts
PivotTables and PivotCharts are a common way to aggregate data values across multiple dimensions. For
example, having imported a table of weather data, users could use a PivotTable or PivotChart to view
temperature aggregated by month and geographical region, as shown in Figure 6.


Figure 6
Analyzing data with a PivotTable and a PivotChart in Excel

PivotTables and PivotCharts are typically most useful for aggregating data across multiple dimensions. If
you want to generate interactive graphical reports that allow users to explore relationships, then Power View is a
better choice.

Power View
Power View is a data visualization technology that enables interactive, graphical exploration of data in a data
model. Power View is available as a component of SQL Server 2012 Reporting Services when integrated with
SharePoint Server, but is also available in Excel 2013. By using Power View, users can create interactive
visualizations that make it easy to explore relationships and trends in the data. Figure 7 shows how Power View
can be used to visualize the weather data in the PowerPivot data model described previously.


Figure 7
Visualizing data with Power View

GeoFlow
GeoFlow is an add-in for Microsoft Excel 2013 Professional Plus that enables you to visualize geospatial and
temporal analytical data on a map. With GeoFlow you can display geographically related values on an interactive
map, and create a virtual tour that shows the data in 3D. If the data has a temporal dimension you can
incorporate time-lapse animation into the tour to illustrate how the values change over time. Figure 8 shows a
GeoFlow visualization in Excel 2013.


Figure 8
Visualizing geospatial data with GeoFlow
GeoFlow was a pre-release add-in for Excel at the time this guide was written. You can download the GeoFlow
add-in from the Microsoft Office Download Center.
Consuming HDInsight Data in SQL Server Reporting Services
SQL Server 2012 Reporting Services (SSRS) provides a platform for delivering reports that can be viewed
interactively in a Web browser, printed, or exported and delivered automatically in a wide range of electronic
document formats; including Microsoft Excel, Microsoft Word, Portable Document Format (PDF), image, and
XML.
Reports consist of data regions such as tables of data values or charts, which are based on datasets that
encapsulate queries. The datasets used by a report are in turn based on data sources, which can include
relational databases, corporate data models in SQL Server Analysis Services, or other OLE DB or ODBC sources.

If you load the results of an HDInsight query into SQL Server or Windows Azure SQL Database you can
use the database reporting tools to produce reports in a variety of formats that are easily accessible to users. SQL
Server Reporting Services offers a wide range of options and fine control over the final reports. Windows Azure
SQL Reporting is less powerful, but can still meet most users needs and requires no on-premises resources.
Reports can be created by professional report authors, who are usually specialists BI developers, using the
Report Designer tool in SQL Server Data Tools. This tool provides a Visual Studio-based interface for creating
data sources, datasets, and reports; and for publishing a project of related items to a report server. Figure 9
shows Report Designer being used to create a report based on data in HDInsight, which is accessed through an
ODBC data source and a dataset that queries a Hive table.


Figure 9
Report Designer
While many businesses rely on reports created by BI specialists, an increasing number of organizations are
empowering business users to create their own self-service reports. To support this scenario business users can
use Report Builder, shown in Figure 10. This is a simplified report authoring tool that is installed on demand
from a report server. To further simplify self-service reporting you can have BI professionals create and publish
shared data sources and datasets that can be easily referenced in Report Builder, reducing the need for business
users to configure connections or write queries.


Figure 10
Report Builder

Reports are published to a report server, where they can be accessed interactively through a web interface.
When Reporting Services is installed in native mode, the interface is provided in a web application named
Report Manager, as shown in Figure 11. You can also install Reporting Services in SharePoint-integrated mode,
in which case the reports are accessed in a SharePoint document library.


Figure 11
Viewing a report in Report Manager

Reporting Services provides a powerful platform for delivering reports in an enterprise, and can be used
effectively in a Big Data solution. The following table describes specific considerations for using Reporting
Services in the HDInsight solution architectures described earlier in this guide.

Architectural Model Considerations
Standalone
experimentation and
visualization
For one-time analysis and data exploration, Excel is generally a more appropriate tool than
Reporting Services because it requires less in the way of infrastructure configuration and provides a
more dynamic user interface for interactive data exploration.
Data transfer, data
cleansing, and ETL
Most ETL scenarios are designed to transform Big Data into a suitable structure and volume for
storage in a relational data source for further analysis and querying. In this scenario, Reporting
Services may be used to consume data from the relational data store loaded by the ETL process.
Basic data warehouse
or commodity storage
When HDInsight is used to implement a basic data warehouse it usually includes a schema of Hive
tables that are queried over ODBC connections. In this scenario you can use Reporting Services to
create ODBC-based data sources and datasets that query Hive tables.
Integration with
Enterprise BI
You can use Reporting Services to integrate data from HDInsight with enterprise BI data at the
report level by creating reports that display data from multiple data sources. For example, you could
use an ODBC data source to connect to HDInsight and query Hive tables, and an OLE DB data
source to connect to a SQL Server data warehouse. However, in an enterprise BI scenario that
combines corporate data and Big Data in formal reports, better integration can generally be
achieved by integrating at the data warehouse or corporate data model level, and by using a single
data source in Reporting Services to connect to the integrated data.
Windows Azure SQL Reporting
Windows Azure SQL Reporting is a cloud-based service that offers similar functionality to SSRS, but with some
limitationsthe most significant of which is that Windows Azure SQL Reporting can only be used to consume
data from Windows Azure SQL Database. Therefore, to include the results of HDInsight data processing in a
report, you must transfer the results from HDInsight to Windows Azure SQL Database by using a technology
such as Sqoop (which is described later in this chapter).
Although Windows Azure SQL Reporting is a cloud-based service that natively generates browser based
reports, you can also use it to generate reports that you embed in applications built with Windows Forms and
other technologies.
Some other features of SSRS are not supported in Windows Azure SQL Reporting; for example, you cannot
schedule report execution, create subscriptions that automatically distribute reports by email, or use
customized report extensions. However, you can create reports in multiple formats for rendering in a browser,
Report Viewer, and custom applications built with Windows Forms or other technologies.
For a comparison of SSRS and SQL Reporting see Guidelines and Limitations for Windows Azure SQL Reporting.
You can use Report Designer or Report Builder to author reports for Windows Azure SQL Reporting Services, and
upload them to a Windows Azure based report server. The following table describes specific considerations for
using Windows Azure SQL Reporting Services in the HDInsight solution architectures described earlier in this
guide.

Architectural Model Considerations
Standalone
experimentation and
visualization
The requirement to transfer the results from HDInsight into Windows Azure SQL Database adds
complexity to the process of visualizing the results; so this approach should only be used if you
intend to export the data to Windows Azure SQL Database or you specifically need cloud-based
reporting of the output from HDInsight.
Data transfer, data
cleansing, and ETL
If the target of the HDInsight-based ETL process is Windows Azure SQL dDatabase, then
Windows Azure SQL Reporting Services may be a suitable choice for reporting the results.
Basic data warehouse or
commodity storage
When HDInsight is used to implement a basic data warehouse it usually includes a schema of
Hive tables that are queried over ODBC connections. Since these cannot be accessed by
Windows Azure SQL Reporting Services it is generally unsuitable for this kind of solution.
Integration with
Enterprise BI
Most enterprise BI solutions are implemented in on-premises data warehouses and data models.
Unless a subset of the BI solution is replicated to Windows Azure SQL Database for cloud-based
reporting, Windows Azure SQL reporting Services is generally not suitable for enterprise BI
integration.
Consuming HDInsight Data in SQL Server Analysis Services
In many organizations, enterprise BI reporting and analytics is a driver for business processes and decision
making that involves multiple users in different parts of the business. In these kinds of scenarios, analytical data
is often structured into a corporate data model (often referred to colloquially as a cube) and accessed from
multiple client applications and reporting tools. Corporate data models help simplify the creation of analytical
mashups and reports, and ensure consistent values are calculated for core business measures and key
performance indicators across the organization.
Appendix B of this guide provides an overview of enterprise BI systems that might be
useful if you are not familiar with the terminology and some special factors inherent in
such systems.
In the Microsoft data platform, SQL Server Analysis Services (SSAS) 2012 provides a mechanism for creating and
publishing corporate data models that can be used as a source for PivotTables in Excel, reports in SQL Server
Reporting Services, dashboards in SharePoint Server PerformancePoint Services, and other BI reporting and
analysis tools. You can install SSAS in one of two modes; tabular mode or multidimensional mode.
SSAS in Tabular Mode
When installed in tabular mode, SSAS 2012 can be used to host tabular data models that are based on the
xVelocity in-memory analytics engine. These models use the same technology and design as PowerPivot data
models in Excel, but can be scaled to handle much larger volumes of data and secured using enterprise-level role
based security. You create tabular data models in the Visual Studio based SQL Server Data Tools development
environment, and you can choose to create the data model from scratch or import an existing PowerPivot for
Excel workbook.
Because tabular models support ODBC data sources you can easily include data from Hive tables in the data
model, and you can use HiveQL queries to pre-process the data as it is imported to create the schema that best
suits your analytical and reporting goals. You can then use the modeling capabilities of SQL Server Analysis
Services to create relationships, hierarchies, and other custom model elements to support the analysis that
users need to perform.
Figure 12 shows a tabular model for the weather data used earlier in this chapter.

Figure 12
A tabular Analysis Services data model

The tabular data model mode of SQL Server Analysis Services supports loading data with the ODBC driver,
which means that you can connect to the results of a Hive query to analyze the data. However, the
multidimensional mode does not support ODBC, and so you must either import the data into SQL Server, or
configure a linked server to pass through the query and results.
The following table describes specific considerations for using tabular SSAS data models in the HDInsight
solution architectures described earlier in this guide.

Architectural Model Considerations
Standalone
experimentation and
visualization
If the results of HDInsight data processing can be encapsulated in Hive tables, and multiple users
must perform consistent analysis and reporting of the results, then an SSAS tabular model is an
easy way to create a corporate data model for analysis. However, if the analysis will only be
performed by a small group of specialist users, you can probably achieve this by using
PowerPivot in Excel.
Data transfer, data
cleansing, and ETL
If the target of the HDInsight-based ETL process is a relational database then you might build a
tabular data model based on the tables in the database to enable enterprise-level analysis and
reporting.
Basic data warehouse or
commodity storage
When HDInsight is used to implement a basic data warehouse it usually includes a schema of
Hive tables that are queried over ODBC connections. A tabular data model is a good way to
create a consistent analytical and reporting schema based on the Hive tables.
Integration with Enterprise
BI
If the enterprise BI solution already uses tabular SSAS models, you can add Hive tables to these
models and create any necessary relationships to support integrated analysis of corporate BI
data and Big Data results from HDInsight. However, if the corporate BI data warehouse is based
on a dimensional model that includes surrogate keys and slowly changing dimensions it can be
difficult to define relationships between tables in the two data sources, and integration may be
better accomplished at the data warehouse level.

SSAS in Multidimensional Mode
As an alternative to tabular mode, SSAS can be installed in multidimensional modewhich is the only mode
supported by releases of SSAS prior to SQL Server 2012. Multidimensional mode provides support for a more
established online analytical processing (OLAP) approach to cube creation, and includes some features that are
not supported, or are difficult to implement, in tabular data models (such as the ability to aggregate semi-
additive measures across accounting dimensions, or include international translations in the cube definition).
Additionally, if you plan to use SSAS data mining functionality you must install SSAS in multidimensional mode.
Multidimensional data models can be built on OLE DB data sources, but due to some restrictions in the way cube
elements are implemented you cannot use an ODBC data source for a multidimensional data model. This means
that there is no way to directly connect dimensions or measure groups in the data model to Hive tables in an
HDInsight cluster.
To use HDInsight data as a source for a multidimensional data model in SSAS you must either transfer the data
from HDInsight to a relational database system such as SQL Server, or define a linked server in a SQL Server
instance that can act as a proxy and pass queries through to Hive tables in HDInsight. The use of linked servers
to access Hive tables from SQL Server, and techniques for transferring data from HDInsight to SQL Server, are
described in the next section of this chapter.
Figure 13 shows a multidimensional data model in SQL Server Data Tools. This data model is based on views in a
SQL Server database, which are in turn based on queries against a linked server that references Hive tables in
HDInsight.

Figure 13
A multidimensional Analysis Services data model
The following table describes specific considerations for using multidimensional SSAS data models in the
HDInsight solution architectures described earlier in this guide.
Architectural Model Considerations
Standalone
experimentation and
visualization
For one-time analysis, or analysis by a small group of users, the requirement to use a relational
database such as SQL Server as a proxy or interim host for the HDInsight results means that this
approach involves more effort than using a tabular data model or just analyzing the data in Excel.
Data transfer, data
cleansing, and ETL
If the target of the HDInsight-based ETL process is a relational database that can be accessed
through an OLE DB connection, you might build a multidimensional data model based on the
tables in the database to enable enterprise-level analysis and reporting.
Basic data warehouse or
commodity storage
When HDInsight is used to implement a basic data warehouse it usually includes a schema of
Hive tables that are queried over ODBC connections. To include these tables in a
multidimensional SSAS data model you would need to use a linked server in a SQL Server
instance to access the Hive tables on behalf of the data model, or transfer the data from HDInsight
to a relational database. Unless you specifically require advanced OLAP capabilities that can only
be provided in a multidimensional data model, a tabular data model is probably a better choice.
Integration with
Enterprise BI
If the enterprise BI solution already uses multidimensional SSAS models that you want to extend
to include data from HDInsight then you should integrate the data at the data warehouse level and
base the data model on the relational data warehouse.
Consuming HDInsight Data in a SQL Server Database
Many organizations already use Microsoft data platform components to perform data analysis and reporting,
and to implement enterprise BI solutions. Additionally, many tools and reporting applications make it easier to
query and analyze data in a relational database than to consume the data directly from HDInsight. For example,
as discussed previously in this chapter, Windows Azure SQL Reporting Services can only consume data from
Windows Azure SQL Database, and multidimensional SSAS data models can only be based on relational data
sources.
After you discover from experimentation a query that provides the data input you require, you will
probably want to integrate HDInsight with your existing BI system. This allows you to easily repeat the query on a
regular basis and use it to update your knowledge base.
The wide-ranging support for working with data in a relational database means that, in many scenarios, the
optimal way to perform Big Data analysis is to process the data in HDInsight but consume the results through a
relational database system such as SQL Server. You can accomplish this by enabling the relational database to
act as a proxy or interface that transparently accesses the data in HDInsight on behalf of its client applications,
or by transferring the results of HDInsight data processing to tables in the relational database.
Linked Servers
Linked servers are server-level connection definitions in a SQL Server instance that enable queries in the local
SQL Server engine to reference tables in remote servers. You can use the ODBC driver for Hive to create a linked
server in a SQL Server instance that references an HDInsight cluster, enabling you to execute Transact-SQL
queries that reference Hive tables.
To create a linked server you can either use the graphical tools in SQL Server Management Studio or the
sp_addlinkedserver system stored procedure, as shown in the following code sample.
Transact-SQL
EXEC master.dbo.sp_addlinkedserver
@server = N'HDINSIGHT', @srvproduct=N'Hive',
@provider=N'MSDASQL', @datasrc=N'HiveDSN',
@provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=UserName;
Password=P@assw0rd;'
After you have defined the linked server you can use the Transact-SQL OpenQuery function to execute pass-
through queries against the Hive tables in the HDInsight data source, as shown in the following code sample.
Transact-SQL
SELECT * FROM OpenQuery(HDINSIGHT, 'SELECT * FROM Observations');

Using a four-part distributed query as the source of the OpenQuery statement is not always a good idea because
the syntax of HiveQL differs from T-SQL in several ways.
By using this technique you can create views in a SQL Server database that act as pass-through queries against
Hive tables, as shown in Figure 14. These views can then be queried by analytical tools that connect to the SQL
Server database.

Figure 14
Using HDInsight as a linked server over ODBC
The following table describes specific considerations for using linked servers in the HDInsight solution
architectures described earlier in this guide.
Architectural Model Considerations
Standalone
experimentation and
visualization
For one-time analysis, or analysis by a small group of users, the requirement to use a relational
database such as SQL Server as a proxy or interim host for the HDInsight results means that this
approach involves more effort than using a tabular data model or just analyzing the data in Excel.
Data transfer, data
cleansing, and ETL
Generally, the target of the ETL processes is a relational database, making a linked server to Hive
tables unnecessary.
Basic data warehouse or
commodity storage
Depending on the volume of data in the data warehouse, and the frequency of queries against the
back-end Hive tables, creating a SQL Server faade for a Hive-based data warehouse might make
it easier to support a wide range of client applications. A linked server is a suitable solution for
populating data models on a regular basis when they are processed during out-of-hours periods,
or for periodically refreshing cached datasets for Reporting Services. However, the performance of
pass-through queries over an ODBC connection may not be sufficient to meet your users
expectations for interactive querying and reporting directly in client applications such as Excel.
Integration with
Enterprise BI
If the ratio of Hive tables to data warehouse tables is small, or they are relatively rarely queried, a
linked server might be a suitable way to integrate data at the data warehouse level. However, if
there are many Hive tables or if the data in the Hive tables must be tightly integrated into a
dimensional data warehouse schema, it may be more effective to transfer the data from HDInsight
to local tables in the data warehouse.
PolyBase
PolyBase is a data integration technology in SQL Server 2012 Parallel Data Warehouse (PDW) that enables data
in an HDInsight cluster to be queried as native tables in the SQL Server data warehouse. SQL Server PDW is an
edition of SQL Server that is only available pre-installed on a data warehouse hardware appliance, and it uses a
massively parallel processing (MPP) architecture to implement highly scalable relational data warehouse
solutions.
PolyBase enables parallel data movement between SQL Server and HDInsight, and supports standard Transact-
SQL semantics such as GROUP BY and JOIN clauses that reference large volumes of data in HDInsight. This
enables SQL Server PDW to provide an enterprise-scale data warehouse solution that combines relational data
in data warehouse tables and data in an HDInsight cluster.
For more information about SQL Server PDW and PolyBase, see the page PolyBase on the SQL Server website.
The following table describes specific considerations for using PolyBase in the HDInsight solution architectures
described earlier in this guide.
Architectural Model Considerations
Standalone
experimentation and
visualization
For one-time analysis, or analysis by a small group of users, the requirement to use a SQL Server
PDW appliance may be cost-prohibitive.
Data transfer, data
cleansing, and ETL
Generally, the target of the ETL process is a relational database, making PolyBase integration with
HDInsight unnecessary.
Basic data warehouse
or commodity storage
If the volume of data and number of query requests are extremely high, using a SQL Server PDW
appliance as a data warehouse that includes HDInsight data through PolyBase might be the most
cost-effective way to achieve the required levels of performance and scalability your data
warehousing solution requires.
Integration with
Enterprise BI
If your enterprise BI solution already uses SQL Server PDW, or the combined scalability and
performance requirements for enterprise BI and Big Data analysis is extremely high, using SQL
Server PDW with PolyBase might be a suitable solution.
However, you should note that PolyBase does not inherently integrate HDInsight data into a
dimensional data warehouse schema, so if you need to include Big Data in dimension members that
use surrogate keys or need to support slowly changing dimensions, you may need to do some
additional integration work.
Sqoop
Sqoop is a Hadoop technology that is included in HDInsight. It is designed to make it easy to transfer data
between Hadoop clusters and relational databases. You can use Sqoop to export data from HDInsight data files
to SQL Server database tables by specifying the location of the data files to be exported, and a JDBC connection
string for the target SQL Server instance. For example, you could run the following command on an HDInsight
server to copy the data in the /hive/warehouse/observations folder to the observations table in a Windows
Azure SQL Database named mydb located in a server named jkty65.
Sqoop command
sqoop export --connect "jdbc:sqlserver://jkty65.database.windows.net:1433;
database=mydb;user=username@jkty65;password=Pa$$w0rd;
logintimeout=30;"
--table observations
--export-dir /hive/warehouse/observations
A clustered index must be defined on the destination table in Windows Azure SQL Database.
A key requirement is that network connectivity can be successfully established between the HDInsight cluster
where the Sqoop command is executed and the SQL Server instance to which the data will be transferred. When
used with HDInsight for Windows Azure, this means that the SQL Server instance must be accessible from the
Windows Azure service where the cluster is running, which may not be permitted by security policies in
organizations where the target SQL Server instance is hosted in an on-premise data center.
However, Sqoop is generally a good solution for transferring data from HDInsight to Windows Azure SQL
Database servers, or to instances of SQL Server that are hosted in virtual machines running in Windows Azure,
but it can present connectivity challenges when used with on-premise database servers.
Sqoop is ideal for transferring data between HDInsight and a database located in the cloud, but may be
more difficult to use with on-premises database servers due to network security restrictions.
The following table describes specific considerations for using Sqoop in the HDInsight solution architectures
described earlier in this guide.
Architectural Model Considerations
Standalone
experimentation and
visualization
For one-time analysis, or analysis by a small group of users, using Sqoop is a simple way to
transfer the results of data processing to Windows Azure SQL Database or a SQL Server instance
for reporting or analysis.
Data transfer, data
cleansing, and ETL
Generally, the target of the ETL processes is a relational database and Sqoop may be the
mechanism that is used to load the transformed data into the target database.
Basic data warehouse or
commodity storage
When using HDInsight as a data warehouse for Big Data analysis the data is generally accessed
directly in Hive tables, making transfer to a database using Sqoop unnecessary.
Integration with
Enterprise BI
When you want to integrate the results of HDInsight processing with an enterprise BI solution at
the data warehouse level you can use Sqoop to transfer data from HDInsight to the data
warehouse tables, or more commonly to staging tables from where it will be loaded into the data
warehouse. However, if network connectivity between HDInsight and the target database is not
possible you may need to consider an alternative technique to transfer the data, such as SQL
Server Integration Services.
SQL Server Integration Services
SQL Server Integration Services (SSIS) provides a flexible platform for building ETL solutions that transfer data
between a wide range of data sources and destinations while applying transformation, validation, and data
cleansing operations to the data as it passes through a data flow pipeline. SSIS is a good choice for transferring
data from HDInsight to SQL Server when network security policies make it impossible to use Sqoop, or when you
must perform complex transformations on the data as part of the import process.
We use SQL Server Integration Services to ensure that the data we load into our BI system is valid. This
includes data cleansing operations such as changing multiple values to a standard leading value so that queries
against the data provide the correct results. Appendix B of this guide explains some of the areas where data
cleansing is useful, if not vital.
To transfer data from HDInsight to SQL Server using SSIS you can create an SSIS package that includes a data
flow task. The data flow task minimally consists of a source, which is used to extract the data from the HDInsight
cluster; and destination, which is used to load the data into a SQL Server database. The data flow might also
include one or more transformations, which apply specific changes to the data as it flows from the source to the
destination.
The source used to extract data from HDInsight can be an ODBC source that uses a HiveQL query to retrieve data
from Hive tables, or a custom source that programmatically downloads data from files in the Windows Azure
blob storage that hosts the HDFS folders used by the HDInsight cluster.
Figure 15 shows an SSIS data flow that uses an ODBC source to extract data from Hive tables and then applies a
transformation to convert the data types of the columns returned by the query, before loading the transformed
data into a table in a SQL Server database.


Figure 15
Using SSIS to transfer data from HDInsight to SQL Server
The following table describes specific considerations for using SSIS in the HDInsight solution architectures
described earlier in this guide.
Architectural Model Considerations
Standalone
experimentation and
visualization
For one-time analysis, or analysis by a small group of users, using SSIS can be a simple way to
transfer the results of data processing to SQL Server for reporting or analysis.
Data transfer, data
cleansing, and ETL
Generally, the target of the ETL processes is a relational database. In some cases the HDInsight
ETL process might be used to transform the data into a suitable structure, and then SSIS can be
used to complete the ETL process by transferring the transformed data to SQL Server.
Basic data warehouse or
commodity storage
When using HDInsight as a data warehouse for Big Data analysis the data is generally accessed
directly in Hive tables, making transfer to a database via SSIS unnecessary.
Integration with
Enterprise BI
SSIS is often used to implement ETL processes in an enterprise BI solution, so its a natural
choice when you need to extend the BI solution to include big data processing results from
HDInsight. SSIS is particularly useful when you want to integrate data from HDInsight into an
enterprise data warehouse, where you will typically uses SSIS to extract the data from HDInsight
into a staging table and even use a different SSIS package to load the staged data in
synchronization with data from other corporate sources.
Custom and Third Party Analysis and Visualization Tools
There are many third party tools for consuming data and generating analytics and visualizations that are not
covered in this chapter. Useful lists of analysis and visualization tools can be found at Datavisualization.ch (see A
Carefully Selected List of Recommended Tools) and at ComputerWorld (see 30+ free tools for data visualization
and analysis).
Using the ODBC driver for Hive allows us to collect the data we want to visualize and then display it using
any of the myriad graphics libraries or charting frameworks that are available. We can also use Linq to query the
data and build powerful and interactive custom analytical UIs and tools.
You can also build your own custom UI. You might choose to use one of the many of graphic libraries, such as
D3.js, that are available for use in data visualization tools and web applications. The .NET Framework also
contains many classes that make it easier to create attractive graphical visualizations such as charts in your
custom applications.
To access the data required to build your UI content you can connect to the output generated by HDInsight
using the ODBC driver directly. However, developers working with the .NET Framework often use the Language
Integrated Query (Linq) technology to work with data. The Microsoft .NET SDK for Hadoop includes the
HiveConnection class, which enables you to write code that uses a Linq query to extract data from Hive. You can
use this class to build custom applications that consume data from Hive tables in an HDInsight cluster. Figure 16
illustrates this feature of the SDK.

Figure 16
Using LinqToHive to build custom visualizations from HDInsight data
For information about using LinqToHive see Microsoft .NET SDK For Hadoop on Codeplex.
Summary
This chapter described a variety of technologies and techniques that you can use to consume data from
HDInsight and use it to perform analysis or reporting. It describes the many options available using Microsoft
Excel, SQL Server Reporting Services and SQL Azure Reporting, SQL Server Analysis Services, and loading the data
into a SQL Server database to make it available to a range of other tools.
This chapter concludes the exploration of the main loading, querying, and analysis techniques for use with
HDInsight. In the next chapter you will see how you can combine many of these techniques, and explore other
tools and frameworks that help you to build complete end-to-end solutions around Windows Azure HDInsight.
More Information
For more information about HDInsight, see the Windows Azure HDInsight web page.
For more information about the tools and add-ins for Excel see the Microsoft Excel website.
For more information about the Microsoft .NET SDK for Hadoop, see the CodePlex project site at
http://hadoopsdk.codeplex.com/.

Chapter 6: Debugging, Testing, and
Automating HDInsight Solutions
Now that youve seen how HDInsight works and how you can use it in a range of different ways, this chapter
explores two topics that will help you to get the most from HDInsight. As with all development tasks you are
likely to need to debug and test your code, especially if you are writing custom map/reduce components. In
addition, after you find an HDInsight solution that provides useful information, you will want to integrate the
process with your existing business systems by automating some or all of the stages, or by incorporating a
workflow.
Challenges
HDInsight solutions are not usually code-heavy. Often you will use high-level tools such as Pig and Hive to create
scripts that are executed by Hadoop. Techniques for debugging and testing these scripts are generally quite
rudimentary. However, if you write complex map/reduce code and you are familiar with working with Java you
may want to apply more comprehensive testing to your solution. For example, you may need to resolve issues
with parallel execution of map and reduce components in a multi-node cluster, or view runtime information and
intermediate values.
Many scenarios for using a Big Data solution such as HDInsight will focus on exploring data, perhaps from newly
discovered sources, and then iteratively refining the queries and transformations used to find insights within
that data. After you discover questions that provide useful and valid information from the data, you will
probably want to explore how you can automate the process. For example, you may want to make it repeatable
without requiring manual interaction every time, and perhaps execute it automatically on a schedule.

If you are integrating HDInsight with your existing business processes you will probably want to automate
the solution to avoid the need for repeated manual interaction. This approach can also help to minimize errors,
and by setting permissions on the client applications you can also limit access so that only authorized users can
execute them.

HDInsight supports several technologies and techniques that you can use instead of the manual interaction
methods you have seen described in this guide. Several of these automation technologies are used in the
scenario examples later in this guide. In this chapter you will see the most common techniques and tools for job
automation, and advice on choosing and implementing the most appropriate ones for your own situation.

The chapter covers the following topics:
Debugging and testing HDInsight solutions.
Executing HDInsight jobs remotely using the .NET SDK for Hadoop.
Creating HDInsight automated workflows using Oozie, SQL Server Integration Services, third-party
frameworks, and custom tools.
Building end-to-end solutions that use HDInsight.

Debugging and Testing HDInsight Solutions
Debugging and testing an HDInsight solution is more difficult than a typical local application. Executable
applications that run on the local development machine, and web applications that run on a local development
web server such as the one built into Visual Studio, can easily be run in debug mode within the integrated
development environment. Developers use this technique to step through the code as it executes, view variable
values and call stacks, monitor procedure calls, and much more.
None of these functions are available when running code in a remote HDInsight cluster on Windows Azure.
However, there are some debugging techniques you can apply. This section contains information that will help
you to understand how to go about debugging and testing when developing HDInsight solutions. Subsequent
sections of this chapter discuss the following topics in more detail:
Debugging by writing out significant values or messages during execution. You can add extra statements
or instructions to your scripts or components to display the values of variables, export datasets, write
messages, or increment counters at significant points during the execution.
Obtaining debugging information from log files. You can monitor log files and standard error files for
evidence that will help you locate failures or problems.
Using a single-node cluster for testing and debugging. Running the solution in a local or remote single-
node cluster can help to isolate issues with parallel execution of mappers and reducers.

Performing runtime debugging and single-stepping through the code in map and reduce
components is not possible within the Hadoop environment. If you want to perform this type
of debugging, or run unit tests on components, you can create an application that executes
the components locally, outside of Hadoop, in a development environment that supports
debugging; and use a mocking framework and test runner to perform unit tests.

However, Hadoop jobs may fail for reasons other than an error in the scripts or code. The two primary reasons
are timeouts and bad input data. By default, Hadoop will abandon a job if it does not report its status or perform
I/O activity every ten minutes. Typically most jobs will do this automatically, but some processor-intensive tasks
may take longer.
You can ensure that a map/reduce job written in Java is kept alive by calling methods of the TaskReporter
instance such as setStatus and progress (see TaskReporter class on the Apache website), or by incrementing a
counter. Another technique that you can apply to all types of jobs is to increase the value of the
mapred.task.timeout property.
If a job fails due to bad input data that the map and reduce components cannot handle, you can instruct Hadoop
to skip bad records. While this may affect the validity of the output, skipping small volumes of the input data
may be acceptable. For more information, see the section Skipping Bad Records in the MapReduce Tutorial on
the Apache website.
Writing Out Significant Values or Messages during Execution
A traditional debugging technique is to simply output values as a program executes to indicate progress. It
allows developers to check that the program is executing as expected, and helps to isolate errors in the code.
You can use this technique in HDInsight in several ways:
If you are using Hive you might be able to split a complex script into separate simpler scripts, and
display the intermediate datasets to help locate the source of the error.
If you are using a Pig script you can:
Dump messages and/or intermediate datasets generated by the script to disk before executing the
next command. This can indicate where the error occurs, and provide a sample of the data for you
to examine.
Call methods of the EvalFunc class that is the base class for most evaluation functions in Pig. You
can use this approach to generate heartbeat and progress messages that prevent timeouts during
execution, and to write to the standard log file. See Class EvalFunc<T> on the Pig website for more
information.
If you are using custom map and reduce components you can write debugging messages to the
standard output file from the mapper and then aggregate them in the reducer, or generate status
messages from within the components. For example:
You can write messages to the standard error (stderr) log file, including the values of significant
variables or execution information that can assist debugging. If you include error handling code in
your components, use this approach to write messages to the standard error output. A standard
error file is created on each node, so you may need to check all of the modes to find the problem.
You can use the setStatus method of the TaskReporter class to update the task status using a string
message; the progress method to update the percentage complete for a job; or the incrCounter
method to increment a custom or built-in counter value. When you use a counter, the result is
aggregated from all of the nodes.
If the problem arises only occasionally, or on only one node, it may be due to bad input data. If skipping
bad records is not appropriate, add code to your mapper class that validates the input data and reports
any errors while attempting to parse or manipulate it. Write the details and an extract of the data that
caused the problem to the standard error file.
You can instruct Hadoop to store a representative sample of data from executing a job in the log file
folder by setting the mapred.task.profile, mapred.task.profile.maps, and mapred.task.profile.reduces
configuration properties for a job. See Profiling on the Apache Hadoop website for more information.
You can kill a task while it is operating and view the call stack and other useful information such as
deadlocks. To kill a job, execute the command kill -QUIT [job_id]. The job ID can be found in the Job
Tracker portal (accessed on port 50030 of your clusters head node). The debugging information is
written to the standard output (stdout) file.

The Hadoop engine generates a wealth of runtime information as it processes jobs. Before you start
modifying scripts or code to find errors, take advantage of this information using the Job Tracker and the Health
and Status Monitor portals on the Hadoop cluster.
Obtaining Debugging Information from Log Files
The core Hadoop engine in HDInsight generates a range of information in log files, counters, and status
messages that is useful for debugging and testing the performance of your solutions. Much of this information is
accessible through the Job Tracker portal (accessed on port 50030 of your clusters head node in a Remote
Desktop session) and the Health and Status Monitor (accessed on port 50070).
For more details of using the Job Tracker portal and the Health and Status Monitor, see the section Accessing
Performance Information in Chapter 4 of this guide.
The following list contains some suggestions to help you obtain debugging information from the log files
generated by HDInsight:
Use the Task page of Job Tracker portal to view number of errors. The Task Details page shows the
errors, and provides links to the log files and the values of custom and built-in counters.
View the history, job configuration, syslog, and other log files directly. For more information, see the
section Accessing Hadoop Log Files Directly in Chapter 7.
Change the default settings of the Apache log4j mechanism that generates log files to show additional
information, or to change the logging behavior as required. The configuration file named
log4j.properties is in the conf\ folder of the HDInsight installation, and you can access it using a Remote
Desktop connection.
Run a debug information script automatically to analyze the contents of the standard error, standard
output, job configuration, and syslog files. Set the properties mapred.map.task.debug.script and
mapred.reduce.task.debug.script to specify the scripts to run over the output from the map and reduce
tasks.

For more information about running debug scripts, see How to Debug Map/Reduce Programs and the
Debugging section of the MapReduce Tutorial on the Apace website.
Using a Single-node Local Cluster for Testing and Debugging
Sometimes errors can occur due to the parallel execution of multiple map and reduce components on a multi-
node cluster. One way to isolate this problem is to execute the solution on a single-node cluster. You can create
a single-node HDInsight cluster in Windows Azure by specifying the advanced option when creating the cluster.
Alternatively you can install the single-node development environment on your local computer and execute the
solution there.
You can download the single-node development version of HDInsight from the Microsoft Download Center.
By using a single-node local cluster you can rerun failed jobs and adjust the input data, or use smaller datasets,
to help you isolate the problem. Set the keep.failed.task.files configuration property to true to instruct the data
node retain failed task files when a job fails. These files may assist you in locating errors in the code.
To rerun the job go to the \taskTracker\task-id\work folder and execute the command % hadoop
org.apache.hadoop.mapred.IsolationRunner ../job.xml. This runs the failed task in a single Java virtual machine
over the same input data.
You can also turn on debugging in the Java virtual machine to monitor execution. More details of configuring the
parameters of a Java virtual machine can be found on several websites, including Java virtual machine settings
on the IBM website.
Executing HDInsight Jobs Remotely
Previous chapters in this guide have described how to execute map/reduce jobs on an HDInsight cluster using
custom map/reduce code, Pig, and Hive. While you can execute these jobs by using the interactive console in
the HDInsight management portal or in a remote desktop session with the cluster, a common requirement is to
initiate and manage jobs remotely.
Instantiating jobs remotely can be achieved in a range of ways. For example, you can:
Use the WebHCatalog API in the .NET SDK for Hadoop to interact with HCatalog (which manages
metadata and can execute jobs). This is a good option when you want to:
Execute jobs in the cluster using code running on a client machine.
Take advantage of the abstractions provided in the SDK to simplify the code.
Use the built-in connection-retry mechanism in the SDK classes to minimize the chance of
connectivity issues to the cluster.
Build end-to-end applications that load, process, and extract data and generate reports or
visualizations automatically.
Use scripts or batch files that run on the name node of the cluster to execute tasks at the command
line. This is a good option when you want to:
Quickly create tasks that automate jobs within the cluster.
Take advantage of the wide range of tools and features installed with HDInsight.
Run jobs on a schedule, perhaps by using a Windows Scheduled Tasks.

You will need to use a Remote Desktop connection to communicate with the cluster and administer the
processes when you use scripts and batch files that run on the cluster.
Using WebHCatalog to Execute Jobs Remotely
HCatalog provides an abstraction layer between jobs and file storage, and can be used to share common tabular
metadata across map/reduce code, and Pig and Hive commands. HCatalog provides a REST-based API named
WebHCatalog for executing commands remotely, and this API is the primary means for initiating data processing
jobs from a client application.
WebHCatalog is often referred to by its project name, Templeton. For more information about using
WebHCatalog see Templeton on the HortonWorks website.
The WebHCatalog API supports GET, PUT, and POST requests to execute commands that create or manipulate
data resources on the cluster. You can make these requests directly in a browser by entering the appropriate
URL, but more commonly you will create client applications that use the API by making HTTP requests to the
cluster.
Each request is addressed to the appropriate templeton/v1 endpoint in the HDInsight cluster. When you are
using the default security settings for HDInsight the request must include a user.name parameter for the user
name to be used to authenticate the request. If the request is made interactively over a connection with which
no associated authentication credentials are currently associated, you are prompted to provide the password.
When using the REST API programmatically you can associate credentials with the connection before making the
first request.
In addition to the user.name parameter, specific commands require their own parameters based on the
operation being performed. Some commands also require a POST request, which you can generate using a tool
such as cURL. For example, the following POST request can be used to store the results of a HiveQL query in a
file.
POST Request
curl -d user.name=your-user-name
-d execute="select * from hivesampletable;"
-d statusdir="hivejob.output"
-u your-user-name:your-password
-k "https://your-cluster.azurehdinsight.net:563/templeton/v1/hive"
Requests that start a map/reduce job return a response containing a job ID. You can use this job ID to track the
status of long running jobs, as shown in the following GET request.
GET URL
https://your-cluster.azurehdinsight.net:563/templeton/v1/queue
/job_id?user.name=your-user_name
Youll need to enter these commands on a single line, and replace your-user-name, your-password, and your-
cluster with the values for your HDInsight cluster.
Using WebHCat from the .NET SDK for Hadoop
In previous chapters of this guide you have seen how the .NET SDK for Hadoop provides many features that
make working with HDInsight much easier that developing your own custom code. The SDK contains features to
help you interact with storage, Oozie workflows, Ambari monitoring, and HCatalog; and for working with
map/reduce code and scripts.
The SDK also includes the WebHCat client (the WebHCatHttpClient class), which provides a wrapper around the
WebHCatalog REST API. By using the WebHCat client you can implement managed code that executes
commands on a remote HDInsight cluster much more easily than by writing your own code to construct and
execute explicit HTTP requests.
For .NET developers, the SDK for Hadoop is the ultimate toolbox for accomplishing a wide range of tasks.
It makes most interactions with HDInsight much easier than developing your own library of functions.
Figure 1 shows an overview of the WebHCat client within the context of the .NET SDK for Hadoop.

Figure 1
The WebHCat client in the .NET SDK for Hadoop
To use the WebHCat client you must use NuGet to import the Microsoft.Hadoop.WebClient package. You can
do this by executing the following command in the Package Manager PowerShell console.
PowerShell
install-package Microsoft.Hadoop.WebClient
With the package and all its dependencies imported, you can create a custom application that automates
WebHCatalog tasks in the cluster. For example, the following code shows how you can use the
WebHCatHttpClient class to load data that you have already uploaded into Windows Azure blob storage for the
cluster (perhaps by using the WebHDFSClient class) into a table.
C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Hadoop.WebHCat.Protocol;

namespace MyWebHCatApp
{
using System;
using System.Threading.Tasks;
using Microsoft.Hadoop.WebHCat.Protocol;

public class Program
{
private const string HadoopUser = "your-user-name";
private const string HadoopUserPassword = "your-password";
private const string Uri
= "https://your-cluster.azurehdinsight.net:563";

public static void Main(string[] args)
{
// call async function to load data
LoadTable().Wait();

Console.WriteLine("\nPress any key to continue.");
Console.ReadKey();
}

private static async Task LoadTable()
{
// Create WebHCatHttpClient connection
var webHCat = new WebHCatHttpClient(new System.Uri(Uri),
HadoopUser, HadoopUserPassword);
webHCat.Timeout = TimeSpan.FromMinutes(5);

// Load data into an existing table
Console.WriteLine("Load data into the 'MyData' table...");
string command
= @"LOAD DATA INPATH '/mydata/data.txt' INTO TABLE MyData;";
var job = await webHCat.CreateHiveJob(command, null, null,
"LoadMyData", null);

// get the Job ID and return, leaving the job running
var result = await job.Content.ReadAsStringAsync();
Console.WriteLine("Job ID: " + result);
}
}
}
For an example of using the WebHDFSClient class to upload a file to Windows Azure blob storage for a cluster,
see the section Using the WebHDFSClient in the Microsoft .NET SDK for Hadoop in Chapter 3 of this guide.
Creating and Using Automated Workflows
Many data processing solutions require the coordinated processing of multiple jobs, often with a conditional
work flow. For example, consider a solution in which data files containing web server log entries are uploaded
each day, and must be parsed to load the data they contain into a Hive table. A workflow to accomplish this
might consist of the following steps:
1. Insert data from the files located in the /uploads folder into the Hive table.
2. Delete the source files, which are no longer required.

This workflow is relatively simple but could be made more complex by adding some conditional tasks; for
example:
1. If there are no files in the /uploads folder, go to step 5.
2. Insert data from the files into the Hive table.
3. Delete the source files, which are no longer required.
4. Send an email message to an operator indicating success, and stop.
5. Send an email message to an operator indicating failure.

Implementing these kinds of workflows is possible in a range of ways. For example, you could:
Use the Oozie framework that is installed with HDInsight. This is a good option when:
You are familiar with the syntax and usage of Oozie.
You are prepared to use a Remote Desktop connection to communicate with the cluster to
administer the processes.
Use the Oozie client in the .NET SDK for Hadoop to execute Oozie workflows. This is a good option
when:
You are familiar with the syntax and usage of Oozie.
You want to execute workflows from within a program running on a client computer.
You are familiar with .NET and prepared to write programs that use the .NET Framework.
Use SQL Server Integration Services (SSIS) or a similar integration framework. This is a good option
when:
You have SQL Server installed and are experienced with writing SSIS workflows.
You want to take advantage of the powerful capabilities of SSIS workflows.
The process for creating SSIS workflows is described in more detail in Scenario 4, BI Integration, and
in Appendix A of this guide.
Use the Cascading abstraction layer software. This is a good choice when:
You want to execute complex data processing workflows written in any language that runs on the
Java virtual machine.
You have complex multi-level workflows that you need to combine into a single task.
You want to control the execution of the map and reduce phases of jobs directly in code.
Use a third party framework such as Hamake or Azkaban. This is a good option when:
You are familiar with these tools.
They offer a capability that is not available in other tools.
Create a custom application or script that executes the tasks as a workflow. This is a good option when:
You need a fairly simple workflow that can be expressed using your chosen programming or
scripting language.
You want to run scripts on a schedule, perhaps driven by Windows Scheduled Tasks.
You are prepared to use a Remote Desktop connection to communicate with the cluster to
administer the processes.

Creating Workflows with Oozie
Oozie is the most commonly used mechanism for workflow development in Hadoop. It is a tool that enables you
to create repeatable, dynamic workflows for tasks to be performed in an HDInsight cluster. The tasks themselves
are specified in a control dependency direct acyclic graph (DAG), which is stored in a Hadoop Process Definition
Language (hPDL) file named workflow.xml in a folder in the HDFS file system on the cluster. For example, the
following hPDL document contains a DAG in which a single step (or action) is defined.
hPDL
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
<start to="hive-node"/>
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>script.q</script>
<param>INPUT_TABLE=HiveSampleTable</param>
<param>OUTPUT=/results/sampledata</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(
wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
The action itself is a Hive job that is defined in a HiveQL script file named script.q, and has two parameters
named INPUT_TABLE and OUTPUT. The code in script.q is shown in the following example.
HiveQL
INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM ${INPUT_TABLE}
The script file is stored in the same folder as the workflow.xml hPDL file, along with a standard file for Hive jobs
named hive-default.xml.
A configuration file named job.properties is then stored on the local file system of the computer on which the
Oozie client tools are installed. This file contains the settings that will be used to execute the job, as shown in
the following example.
job.properties
nameNode=asv://my_container@my_asv_account.blob.core.windows.net
jobTracker=jobtrackerhost:9010
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=/workflowfiles/
To initiate the workflow the following command is executed on the computer where the Oozie client tools are
installed.
Command Line
oozie job -oozie http://localhost:11000/oozie/ -config c:\scripts\job.properties run
When Oozie starts the workflow it returns a job ID in the format 0000001-123456789123456-oozie-hdp-W. You
can check the status of a job by opening a Remote Desktop connection to the cluster and using a web browser
to navigate to http://localhost:11000/oozie/v0/job/job-id?show=log.
If you are familiar with .NET programming techniques, the easiest way to use Oozie is through a client
application that uses the .NET SDK for Hadoop. This approach also makes it easy to schedule jobs using Windows
utilities or Scheduled Tasks.
Using Oozie from the .NET SDK for Hadoop
The .NET SDK for Hadoop includes an OozieClient class that you can use to implement managed code that
initiates and manages Oozie workflows. Figure 2 shows an overview of this class within the context of the .NET
SDK for Hadoop.

Figure 2
The Oozie client in the .NET SDK for Hadoop
To use the Oozie client you must use NuGet to import the Microsoft.Hadoop.WebClient package. You can do
this by executing the following command in the Package Manager PowerShell console.
PowerShell
install-package Microsoft.Hadoop.WebClient
After the package and all its dependencies are installed you can use the OozieClient class to initiate workflow
jobs and check job status, as shown in the following code sample.
C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Hadoop.WebClient.OozieClient;
using Microsoft.Hadoop.WebClient.OozieClient.Contracts;

namespace OozieApp
{
using System;
using System.Threading.Tasks;
using Microsoft.Hadoop.WebClient.OozieClient;
using Microsoft.Hadoop.WebClient.OozieClient.Contracts;

public class Program
{
private const string HadoopUser = "your-user-name";
private const string HadoopUserPassword = "your-password";
private const string Uri =
"https://your-cluster.azurehdinsight.net:563";
private const string Asv =
"asv://your-container@your-account.blob.core.windows.net";

public static void Main(string[] args)
{
// call async function to start workflow
StartWorkflow().Wait();
Console.WriteLine("\nPress any key to continue.");
Console.ReadKey();
}

private static async Task StartWorkflow()
{
OozieHttpClient client = new OozieHttpClient(
new System.Uri(Uri),
HadoopUser,
HadoopUserPassword);

// Submit a new Job
var prop = new OozieJobProperties(
HadoopUser,
Asv,
"jobtrackerhost:9010",
"/workflowfiles/",
string.Empty,
string.Empty);

var newJob = await client.SubmitJob(prop.ToDictionary());
var content = await newJob.Content.ReadAsStringAsync();
Console.WriteLine(content);
Console.WriteLine(newJob.StatusCode);

// Get the job information
var serializer = new JsonSerializer();
dynamic json = serializer.Deserialize(
new JsonTextReader(new StringReader(content)));
string id = json.id;
var jobInfo = await client.GetJobInfo(id);
var jobStatus = await jobInfo.Content.ReadAsStringAsync();

Console.WriteLine("Job Status:");
Console.WriteLine(jobStatus);
}
}
}
You will see a demonstration of using an Oozie workflow in Scenario 2, Extract, Transform, Load, of this guide
and in the associated Hands-on Lab.
Building End-to-End Solutions on HDInsight
Automation and workflows allow you to replace all of the manual interaction for HDInsight and Hadoop with
completely automated end-to-end solutions. For example, you can use the .NET SDK for Hadoop and other tools
described in this chapter to create a client program that:
Uses an SSIS workflow to connect to the URL of a service that publishes updated data you want to
analyze.
Uses the storage client in the Windows Azure SDK to push the new data into Windows Azure blob
storage.
Creates and provisions a new HDInsight cluster based on this blob storage container using the
PowerShell Cmdlets for HDInsight.
Executes Hive scripts to create the tables that will store the results.
Runs an Oozie workflow that executes Hive, Pig, or custom map/reduce jobs to analyze the data and
summarize the results into the Hive tables.
Uses an SSIS workflow to extract the data into a database and execute a report generator or update a
dashboard.
Deletes the cluster using the PowerShell Cmdlets for HDInsight to minimize ongoing costs.

The chapters later in this guide describe some real-world scenarios for using HDInsight. The appendices at the
end of this guide demonstrate several of the automation tasks described in this chapter.
Summary
This chapter described techniques that you can use to debug, test, and automate HDInsight jobs; and define
workflows for complex data processing scenarios.
All developers will, at some point, discover errors in their code and need to perform debugging and testing to
resolve these errors. If you are familiar with working in Java you can complement testing by using mock objects
and a test runner framework. This is typically most common when writing custom map/reduce code, though as
you saw in this chapter there are also ways that you can apply simple debugging techniques to solutions that
use map/reduce code, Pig, and Hive.
In addition, by using the tools and techniques described in this chapter, you can implement repeatable complex
data processing solutions that can easily be integrated with business applications or custom management tools.
If you have discovered a question that you can reliably and repeatedly answer using HDInsight, you will need to
find an automated solution that can relieve you of the need to manually perform the processes you probably
used during development.
More Information
For more information about HDInsight, see the Windows Azure HDInsight web page.
For more information about WebHCatalog, see the documentation at
http://people.apache.org/~thejas/templeton_doc_latest/index.html.
For more information about Oozie, see the Apache project site for Oozie at http://oozie.apache.org/.
For more information about the Microsoft .NET SDK for Hadoop, see the CodePlex project site at
http://hadoopsdk.codeplex.com/.


Chapter 7: Managing and Monitoring
HDInsight
This chapter ties up some loose ends by looking at topics related to managing and monitoring HDInsight clusters.
Windows Azure HDInsight makes it very easy to create and set up a cluster. However, behind this simplicity
there are many operations and processes that HDInsight performs just to create the cluster, and many more
involved in executing and monitoring jobs in the cluster.
Challenges
As with any remote service or application, managing and monitoring its operation may appear to be more
difficult than for a locally installed equivalent. However, remote management and monitoring technologies are
widely available, and are an accepted part of almost all administration tasks these dayseven with services and
applications running in a remote datacenter. In many cases the extension of these technologies to cloud-hosted
services and applications is almost seamless.
As an inherently remote environment, Windows Azure already has many remote administration features built
in, incorporated into development tools such as Visual Studio, and available as an add-on from third parties; and
there are plenty of tools that allow you to create your own custom administration utilities.
The deceptively simple management portal that you see when you navigate to HDInsight in your web browser
hides that fact that creating a cluster automatically provisions a number of virtual machines that run the Hadoop
core code, provisions a Windows Azure SQL Database to store the metadata for the cluster, and links the cluster
with the Windows Azure blob storage container that holds the data. Many parts of this infrastructure are
accessible remotely and allow you to administer the cluster configuration and operation, perform specific
management tasks, and monitor the cluster behavior.

The following sections of this chapter explore the common management tasks and options in HDInsight.
Administering HDInsight clusters and jobs.
Configuring HDInsight clusters and jobs.
Retaining data and metadata when recreating a cluster.
Monitoring HDInsight clusters and jobs.


Administering HDInsight Clusters and Jobs
In addition to the Remote Desktop connection and the management portal, Hadoop includes portal pages for
specific task areas of a cluster, and you can access these for your HDInsight cluster. For example, you can open a
Remote Desktop connection and use the Map/Reduce Administration page of the job tracker host to view
details of the allocation of tasks in your cluster, and you can use the Map/Reduce History Viewer page of the
head node host to see information about individual jobs that the cluster executed.
Administering HDInsight involves using a range of different tools and approaches. Some are part of the
underlying Hadoop ecosystem while many others are third party tools, SDKs, and (of course) custom scripts.
In terms of managing HDInsight, the most common operations you are likely to carry out are:
Working with the HDInsight management portal to perform basic administration tasks interactively.
Interacting with the HDInsight service APIs to perform tasks such as creating and deleting clusters,
working with the source and output files, and managing jobs. You can use a range of tools, scripts, and
custom utilities to access the HDInsight service management API.
Interacting with Windows Azure using scripts, the Windows Azure PowerShell Cmdlets, or custom
utilities to perform tasks not specific to HDInsight, such as uploading and downloading data.

Using the HDInsight Cluster Management Portal
The primary point of interaction with HDInsight is the web-based cluster management portal. This provides a
range of one-click options for accessing different features of HDInsight.
The main Windows Azure Management Portal contains a section for HDInsight that lists all of your HDInsight
clusters. Selecting a cluster displays a summary of the details of the cluster, and a rudimentary monitoring
section that shows the activity of the cluster over the previous one or four hours. To open the management
portal for a cluster you can either use the Manage link at the bottom of the dashboard window, or navigate
directly to the cluster using the URL
https://your-cluster-name. azurehdinsight.net/. You must sign in using the credentials you provided when you
created the cluster.
In terms of administration tasks, at the time of writing the management portal provides links to:
The Interactive Console where you can execute queries and commands. The interactive console
contains the JavaScript console where you can execute administration tasks such as managing files in
the cluster (listing, deleting, and viewing the files), and executing Hadoop management commands.
A Remote Desktop connection to the name node of the cluster that allows you to view the built-in
Hadoop management websites, interact with the file system, and manage the operating system.
The Monitor Cluster and Job History pages. These are described in more detail in the section
Monitoring HDInsight Clusters and Jobs later in this chapter.
Figure 1 shows the HDInsight cluster management portal main page.


Figure 1
The HDInsight cluster management portal main page

Using the Management Portal Interactive Console
The Interactive Console link on the main HDInsight management portal opens a window where you can select
either Hive or JavaScript. The JavaScript console can be used to execute a range of commands related to
administering HDInsight. For example, you can list current jobs, evaluate expressions, show the contents of
objects, show global variables, and get the results of previous jobs.
To see a list of all of the available JavaScript commands type help() into the JavaScript
console. To execute an HDFS or Hadoop command in the JavaScript console, precede it
with the hash (#) character.

You can also type commands that are sent directly to Hadoop for execution within the core Hadoop
environment. For example, you can list and copy files, interact with jobs, view information about queues,
manage logging settings, and run other utilities.
Using a Remote Desktop Connection
The Remote Desktop link in the main HDInsight management portal downloads a Remote Desktop Protocol
(RDP) link that you can open to connect to the name node of the cluster. You must provide the administrator
user name and password to connect, and after successful authentication you can work with the server in the
same way as you would with a local computer.
On the desktop of the name node computer is a link named Hadoop Command Line that opens a command
window on the main folder where Hadoop is installed. You can then run the tools and utilities included with
HDInsight, send commands to Hadoop, and submit jobs to the cluster.
Figure 2 shows the Hadoop command line in a Remote Desktop session.

Figure 2
The Hadoop command line in a Remote Desktop session
For a full list and description of the Hadoop management tools and utilities, see the Commands Manual on the
Apache Hadoop website.
Using the HDInsight Service Management API
Windows Azure exposes a series of management APIs that you can use to interact with most parts of the remote
infrastructure. The service management API for Windows Azure HDInsight allows you to remotely execute many
of the tasks that you can carry out through the management portal web interface, and several other tasks that
are not available in the web interface.
These tasks include the following:
Listing, deleting, and creating HDInsight clusters.
Submitting jobs to HDInsight.
Executing Hadoop management tools and utilities to create archives, copy files, set logging levels,
rebalance the cluster, and more.
Managing workflows that are execute by Oozie.
Managing catalog data used by HCatalog.
Accessing monitoring information from Ambari.

See the section Executing HDInsight Jobs Remotely in Chapter 6 for information on how to submit jobs
remotely to an HDInsight cluster. See the section Monitoring HDInsight Clusters and Jobs later in this chapter
for more details of how you can monitor an HDInsight cluster.
HDInsight includes WebHCat, which is sometimes referred to by the original project name Templeton.
WebHCat provides access to the HDInsight service management API on port 563 of the clusters name node, and
exposes it as a REST interface. Therefore, you must submit appropriately formatted HTTPS requests to WebHCat
to carry out management tasks. For example, to access Oozie on the HDInsight cluster you could send the
following GET request:
https://your-cluster.azurehdinsight.net:563/oozie/v1/admin/
To view the Hadoop job queue you could send the following GET request:
https://your-cluster.azurehdinsight.net:563/templeton/v1/queue/
Creating REST URLs and Commands Directly
Its possible to execute both GET and POST commands by using a suitable utility or web browser plugin. For
example, you might decide to use:
The Simple Windows Azure REST API Sample Tool available from Microsoft.
The Postman add-in for the Google Chrome browser.
The RESTClient add-in for the Mozilla Firefox browser.
You can also use a utility such cURL, which is available for a wide range of platforms including Windows, to help
you construct REST calls directly.
Using Code to Access the REST APIs
You can make calls to WebHCat and the service management API from within your own programs and scripts.
Using programs or scripts, perhaps with optional or mandatory parameters that are supplied at runtime, can
simplify management tasks and make them more repeatable with less chance of error. Programs or scripts can
be written using the .NET Framework with any .NET language, in PowerShell, or using any of the REST client
libraries that are available.
The .NET SDK for Hadoop provides several tools and utilities that make it much easier to work with
HDInsight; not only for administration tasks but for developing applications as well.
As discussed in previous chapters, Microsoft provides the .NET SDK For Hadoop, which (amongst other things) is
a library of tools that make it easy to interact with the service management API. Figure 3 shows an overview of
the integration of the .NET SDK for Hadoop with HDInsight and the underlying Hadoop environment. At the time
of writing the programmatic cluster management feature was not complete.

Figure 3
Overview of the .NET SDK for Hadoop and its integration with HDInsight
The SDK includes the WebClient library that is designed to make it easy to interact with HDInsight. This library
contains client libraries for use with the HDInsight administration interfaces for Oozie, HDFS, Ambari, and
WebHCat. The WebHDFS client library works with files in HDFS and in Windows Azure blob storage. The
WebHCat client library manages the scheduling and execution of jobs in HDInsight cluster, as well as providing
access to the HCatalog metadata. It also includes tools to make it easier to manipulate the results of a query
using the familiar Linq programming style, and a set of helpers that simplify writing map/reduce components in
.NET languages.

See the section Using the Hadoop Streaming Interface in Chapter 4 for more information about writing
map/reduce components in .NET languages.
You can download and install the .NET SDK for Hadoop using the NuGet Package Manager in Visual Studio. You
simply execute the following commands to add the libraries and references to the current Visual Studio project:
Visual Studio Command Line
install-package Microsoft.Hadoop.MapReduce
install-package Microsoft.Hadoop.Hive
install-package Microsoft.Hadoop.WebClient
install-package Microsoft.WindowsAzure.Management.HDInsight

After installation you can use the classes in the SDK in your code to administer an HDInsight cluster; for example,
you can:
Use the methods of the WebHDFSClient class to set properties of HDFS; create, delete, open, write to,
and get the status of files; and create, delete, rename and access directories.
Use the methods of the WebHCatHttpClient class to create, delete, and get the status of Hive, Pig, and
streaming map/reduce jobs.
Use the methods of the OozieHttpClient class to submit, start, kill, suspend, resume, and monitor Oozie
jobs.
Use the methods of the AmbariClient class to retrieve basic information about the cluster, statistics for
map/reduce jobs, and storage bandwidth usage.
Use the methods of the ClusterProvisioningClient class to create, list, and delete clusters.


The SDK also includes a set of PowerShell cmdlets for provisioning, listing, deleting, and managing HDInsight
clusters. You can use these in your scripts to perform tasks through the HDInsight service management API.

For more information, see Using the Hadoop .NET SDK with the HDInsight Service on the Windows Azure
website and the Documentation page of the .NET SDK for Hadoop on Codeplex.

Interacting with Windows Azure Blob Storage
Many administrative tasks may require access to the Azure Storage Vault (ASV) in Windows Azure blob storage
that HDInsight uses. For example, you may need to upload or download files, or manage the files stored there.
You can use the JavaScript console, the Hadoop command line, the Windows Azure PowerShell Cmdlets, any of
the many standalone utilities available for accessing Windows Azure storage; or you can use a client library that
makes it easier to access storage programmatically if you want to automate the operations.
You can also use Visual Studios Server Explorer to access a Windows Azure storage account, though other third
party tool such as Windows Azure Storage Explorer provide additional functionality and may be a better choice
for interactive management of the HDInsight ASV.
If you need to include access to the ASV in a script or program you can use the features of the .NET SDK For
Hadoop, which is described in the previous section of this chapter. You can also use the sample BlobManager
library package provided with this guide as a starting point to build a custom upload utility. This package allows
you to use up to 16 threads to upload the data in parallel, break down the files into 4 MB blocks for optimization
in HDFS, and merge blocks of less than 4 MB into a single file. You can also integrate the package with tools and
frameworks such as SQL Server Integration Services.
To obtain the BlobManager library, download the Hands-on Labs associated with this guide from
http://wag.codeplex.com/releases/view/103405. The source code for the utility is in the Lab 4 - Enterprise BI
Integration\Labfiles\BlobUploader subfolder of the examples.
For more information about working with Windows Azure blob storage and HDInsight ASV, see How to Upload
Data to HDInsight on the Windows Azure website, and Chapter 3 of this guide.
Configuring HDInsight Clusters and Jobs
Setting up and using a Windows Azure HDInsight cluster is easy using the Windows Azure Management Portal.
There are just a few options to set, such as the cluster name and the storage account, and the cluster can be
used without requiring any other setup or configuration.
However, the Hadoop configuration is extremely flexible and wide ranging, and can be used to ensure that your
solution in reliable, secure, and efficient. The configuration mechanism allows configuration settings (or
configurable properties) to be specified at the cluster level, at the individual node (site) level, and for individual
jobs that are submitted to the cluster for processing. There are also configuration properties for each of the
tools (or daemons) such as HDFS, the map/reduce engine, Hive, Oozie, and more.
You can adjust the configuration settings after you have created a cluster in a number of ways:
You can use the SET statement to set the value of a property in a script, which changes the value just for
that job at runtime.
You can add parameters that set the values of properties at runtime when you execute a job at the
command line, which changes the value just for that job.
You can edit the service configuration files through a Remote Desktop session to the name node of the
cluster to permanently change the values of properties for the cluster.

In most cases you should consider setting configuration property values at runtime using the SET
statement or a command parameter. This allows you to specify the most appropriate settings for each job, such
as compression options or other query optimizations. However, sometimes you may need to edit the
configuration files to set the properties for all jobs.

The following sections of this chapter describe these techniques, and provide information on where the settings
are located in the configuration files. The important points to note are:
If you use the SET statement in a script or a command parameter to change settings at runtime, the
setting will apply only for the duration of that job. This is typically appropriate for settings such as
compression or query optimization because each job may require different settings.
If you change the values in the configuration files, the settings are applied to all of the jobs you execute,
and so you must ensure that these settings are appropriate for all subsequent jobs, or edit them as
required for each job. Where jobs require different configuration settings you should avoid editing the
configuration files if possible.
Changes you make to configuration files are lost if you delete and recreate the cluster. The new cluster
will always have the default HDInsight configuration settings. You cannot simply upload the old
configuration files to the new cluster because details of the cluster, such as the IP addresses of the
nodes and the database configuration, will differ each time.
In some cases you can maintain sets of edited configuration files in the cluster and choose which is
applied to each job at runtime. This may prove a useful way to set properties that cannot be set
dynamically at runtime using the SET statement or a command parameter.

Setting Properties at Runtime
In general you should consider setting the values of configuration properties at runtime when you submit a job
to the cluster. In a Hive query you can use the SET statement to set the value of a property. For example, the
following statements at the start of a query set the compression options for that job.
HiveQL
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=
org.apache.hadoop.io.compress.GzipCodec;
...

Alternatively, when executing a Hadoop command, you can use the -D parameter to set property values, as
shown here.
Command Line
hadoop [COMMAND] D property=value

The space between the -D and the property name can be omitted.

You can also specify that the cluster should use an alternative set of configuration files for this job, which allows
you to maintain different groups of property settings and avoids the need to add them to every command line.
In addition, you can execute a range of user or administrative commands such as setting the logging level.

The general syntax of a command is as follows (all of the parameters are optional):
Command Line
hadoop [--config confdir] [COMMAND] [GENERIC_OPTS] [COMMAND_OPTS]
confdir specifies the path of the folder containing alternative configuration settings.
COMMAND is the job command to execute, such as a map/reduce query.
GENERIC_OPTS are option settings specific to this job, prefixed with a single hyphen. In addition to the -D
parameter to set a property you can specify the files, archives, JAR files, or other parameters required to
execute the job.
COMMAND_OPTS are user or administrative commands to execute.

For more details of the options you can use in a job command, see the Apache Hadoop Commands Manual. Its
also possible to access and manipulate configuration information using code in conjunction with the
Configuration class. For more information see the Apache Hadoop reference guide to the Configuration class.

Setting Properties by Editing the Configuration Files
The management portal provides a link that connects to the name node of your cluster over Remote Desktop
Protocol (RDP). You can use this approach to access and edit the Hadoop configuration files within your
HDInsight cluster. Figure 4 shows a schematic view of the configuration files for the cluster, a site within the
cluster, the individual daemons.


Figure 4
Schematic view of the configuration options for HDInsight
For a full list of all of the configuration settings, and descriptions of their usage, go to the Current
Documentation page of the Apache Hadoop site and use the links in the Configuration section of the navigation
bar to select the appropriate configuration file.
HDInsight Cluster and Security Configuration
HDInsight contains three main configuration files that determine the behavior of the cluster as a whole, as
shown in the following table. With the exception of the first item in the table, all of the configuration files are in
the folder C:\apps\dist\hadoop-[version]-SNAPSHOT\conf.

File Description
C:\Config\[your-cluster-id]
.IsotopeHeadNode_IN_0.1.xml
This file contains the network topology information and runtime details of the
HDInsight cluster. You should not edit this file.
hadoop-policy.xml This file contains the authorization settings for the features of the cluster such as
connecting to the nodes of the cluster, submitting jobs, and performing
administrative tasks. By default each operation allows any user (*). You can
replace this with a comma-delimited list of permitted user names or user group
names.
mapred-queue-acls.xml This file contains the authorization settings for the map/reduce queues. If you
add the mapred.acls.enabled property with the value true to the configuration
file conf\mapred-site.xml you can use the queue settings to specify a comma-
delimited list of user names or user group names that can access the job
submission queue and the administration queue.
capacity-scheduler.xml This file contains the scheduling settings for tasks that Hadoop will execute. You
can change settings that specify a wide range of factors that control how users
jobs are queued, prioritized, and managed.
The IsotopeHeadNode file defines the configuration of the cluster, and you should avoid changing any values in
it. The next two files shown in the table define the authorization rules for the cluster. You can specify the list of
user account names or user account group names for a wide range of operations by editing the hadoop-
policy.xml file. For example, you can control which users or groups can submit jobs by setting the following
property:
XML
<property>
<name>security.client.protocol.acl</name>
<value>username1,username2 groupname1,groupname2</value>
</property>
The ACL is a comma-separated list of user names and a comma-separated list of group names, with the two lists
separated by a space. A special value of "*" means all users are allowed.
To control which users and groups can access the default and the administration queues, set the
mapred.queue.default.acl-submit-job and mapred.queue.default.acl-administer-jobs properties in the
mapred-queue-acls.xml configuration file. For example, you can control which users or groups can access the
administration queue to view job details, and modify a jobs priority, by setting the following property:
XML
<property>
<name>mapred.queue.default.acl-administer-jobs</name>
<value>username1,username2 groupname1,groupname2</value>
</property>
As with the hadoop-policy.xml file, the ACL is a comma-separated list of user names and a comma-separated list
of group names, with the two lists separated by a space. A special value of "*" means all users are allowed, and a
single space means no users are allowed except the job owner.
Note that you must add the mapred.acls.enabled property the conf\mapred-site.xml configuration file and set
it to true in to enable the settings in the mapred-queue-acls.xml configuration file. You can also over-ride the
setting for administrators by adding the mapreduce.cluster.administrators property to the conf\mapred-
site.xml configuration file and specifying the ACLs.
The last file listed in the previous table, capacity-scheduler.xml, can be used to manage the allocation of jobs
submitted by users, and allow higher priority jobs to run as soon as possible. As a simple example, the following
settings limit the maximum number of jobs that can be queued for processing by the cluster to 100 and the
maximum number of jobs that can be queued for each user to 10.
XML
<property>
<name>mapred.capacity-scheduler.default-maximum-active-
tasks-per-queue</name>
<value>100</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-maximum-active-
tasks-per-user</name>
<value>10</value>
</property>

HDInsight Site Configuration
Hadoop includes a set of default XML configuration files for the site (each node in the cluster) that are read-only,
and act as the templates for the site-specific configuration files. The default files contain a description for each
of the property settings. The main default files are shown in the following table.
File Description
C:\apps\dist\hadoop-[version]-
SNAPSHOT\src\core\core-default.xml
This file contains the global property settings for Hadoop features including
logging, the file system I/O settings, inter-site and client connectivity, the web
interfaces, the network topology, and authentication for the web consoles.
C:\apps\dist\hadoop-[version]-
SNAPSHOT\src\hdfs\hdfs-default.xml
This file contains the property settings for HDFS such as the IP addresses of the
cluster members, the replication settings, authentication information, and
directory folder locations.
C:\apps\dist\hadoop-[version]-
SNAPSHOT\src\mapred\mapred-
default.xml
This file contains property settings that control the behavior of the map/reduce
engine. This includes the job tracker and task tracker settings, the memory
configuration, data compression, processing behavior and limits, authentication
settings, and notification details.
C:\apps\dist\oozie-[version]\oozie-
[version]\conf\oozie-default.xml
This file contains property settings that allow you to specify the behavior of
Oozie such as the purging of old workflow jobs, the queue size and thread
count, concurrency, throttling, timeouts, the database connection details for
storing metadata, and much more.

In most cases you should set the same configuration property values on every server in the cluster. This is
particularly true in HDInsight where all of the cluster members are equivalent in virtual machine size, power,
resources, and processing speed. Hadoop includes a rudimentary replication mechanism that can synchronize the
configuration across all the servers in a cluster. However, where you set up an on-premises Hadoop cluster using
different types of servers for the data nodes you may need to edit the configuration on individual servers.

You should not edit these default files. Instead, you edit the equivalent files located in the C:\apps\dist\hadoop-
[version]-SNAPSHOT\conf\ folder of each site: core-site.xml, hdfs-site.xml, and mapred-site.xml. These site-
specific files do not contain all of the property settings that are in the default files; they merely over-ride the
default settings for that site. If you want to over-ride a property setting inherited from the default configuration
that is not included in the site-specific configuration file you must copy the element into the site-specific file.
For example, to change the setting for the amount of memory allocated to the buffer that is used for sorting
items during a map/reduce operation, you would need to copy the following element from the mapred-
default.xml file (in C:\apps\dist\hadoop-[version]-SNAPSHOT\src\mapred\) into the into the mapred-site.xml
file in (C:\apps\dist\hadoop-[version]-SNAPSHOT\conf\):
XML
<property>
<name>io.sort.mb</name>
<value>150</value>
<description>The total amount of buffer memory to use
while sorting files, in megabytes.</description>
</property>
Some of the settings that it may be useful to edit are described in the query performance tuning section of this
guide. For more information, see the section Tuning Query Performance in Chapter 4 of this guide.

You can see the values of all the properties used by a job in the Job Tracker portal. Open a Remote Desktop
connection to the cluster from the main Windows Azure HDInsight portal and choose the desktop link named
Hadoop MapReduce Status. Scroll to the Retired Jobs section of the Map/Reduce Administration page and
select the job ID. In the History Viewer page for the job select the JobConf link near the top of the page.

HDInsight Tools Configuration
Most of the tools and daemons that are included in HDInsight have configuration files that determine their
behavior. The following table lists some of these files. All of these are in subfolders of the C:\apps\dist\ folder.
File Description
hive-[version]
\conf \hive-site.xml
This file contains the settings for the Hive query environment. You can use it to specify
most features of Hive such as the location of temporary test and data files, concurrency
support, the maximum number of reducer instances, task progress tracking, and more.
templeton-[version]
\conf\templeton-site.xml
This file contains the settings for WebHCat (also referred to as Templeton, hence the
file name), which is a REST API for metadata management and remote job submission.
Most of the settings are the location of the services and code libraries used by
WebHCat; however, you can use this file to set the port number that WebHCat will listen
on.
sqoop-[version]\conf\sqoop-
site.xml
This file contains the settings for the Sqoop data transfer utility. You can use it to specify
connection manager factories and plugins that Sqoop will use to connect to data stores,
manage features of these connections, and specify the metastore that Sqoop will use to
store its metadata.
oozie-[version]\conf\oozie-
site.xml
This file contains settings for the Oozie workflow scheduler. You can use it to specify
authorization details, configure the queues, and provide information about the metadata
store Oozie will use.
\oozie-[version]\oozie-
server\conf\web.xml
This file contains settings that specify the servlets Oozie will use to respond to web
requests. It also contains the settings for other features such as session behavior, MIME
types, default documents, and more.
For more information about these configuration files go to the projects list page of the Apache website and use
the index to find the tool or project you are interested in, then follow the Project Website link in the main
window. For example, the project website for Hive is http://hive.apache.org. Most of the project sites have a
Documentation link in the navigation bar.
Retaining Data and Metadata for a Cluster
Windows Azure HDInsight is a pay for what you use service. When you just want to use a cluster for
experimentation, or to investigate interesting questions against some data, you will usually tear down the
cluster when you have the answer (or when you discover that there is no interesting question for that particular
data) in order to minimize cost.
However, there may be scenarios where you want to be able to release a cluster and restore it at a later time.
For example, you may want to run a specific set of processes on a regular basis, such as once a month, and just
upload new data for that month while reusing all the Hive tables, indexes, UDFs, and other objects you define in
the cluster. Alternatively, you might want to scale-out by adding more cluster nodes. The only way to do this is
to delete the cluster and recreate it with the number of nodes you now require.
The data you use in an HDInsight cluster is stored in Windows Azure blob storage, and so it can be persisted
when you tear down a cluster. When you create a new cluster you can specify an existing storage account that
holds your cluster data as the source for the new cluster. This means that you immediately have access to the
original data, and the data remains available after the cluster has been deleted.
However, the metadata for the cluster, which includes the Hive table and index definitions, is stored in a SQL
Server database that is automatically provisioned when you create a new cluster, and destroyed when the
cluster is removed from HDInsight. If you want to be able to delete and then recreate a cluster you have two
choices:
You can allow HDInsight to destroy the metadata and remove the SQL Database when you delete the
cluster, then run scripts that recreate the metadata in a new cluster. These scripts are the same as you
used to create the Hive tables, indexes, and other metadata in the original cluster.
You can configure the cluster to use a separate SQL Database instance that you can then retain when
the cluster is deleted. After you recreate the cluster based on the same blob storage container in an
existing storage account, you configure the new cluster to use the existing SQL Database containing the
metadata.

The first of these options is simpler, and means that you do not need to preconfigure the cluster or change the
configuration of the existing cluster, although you must create the original cluster over an existing blob
container in your Windows Azure storage account by choosing the Advanced Create option. However, you must
have access to all of the scripts (you can combine them into one script) to recreate all of the metadata. The only
real issue may be the time it takes to execute the script if the metadata is very complex.
The second of these options, described in more detail in the next section of this chapter, is the more
complicated; but it may be a good choice if you have very large volumes of metadata.
Windows Azure storage automatically maintains replicas of the data in your blobs, and
SQL Database automatically maintains copies of the data it contains, so your data will not
be lost should a disaster occur at a datacenter. However, this does not protect against
accidental or malicious damage to the data because the changes will be replicated
automatically. If you need to keep a backup for disaster recovery you can copy the blob
data to another datacenter storage account or to an on-premises location, and export a
backup from the SQL Database that stores the metadata as a file you can store in another
storage account or an on-premises location.
Configuring a Cluster to Support Redeployment
To maintain the metadata in a cluster, and to be able to access it in a new cluster, you can change the location of
the metadata storage by following these steps:
Open a Remote Desktop connection to your HDInsight cluster and obtain its public IP address from the
following section of the configuration file named your-cluster-id.IsotopeHeadNode_IN_0.1.xml (located
in the c:\apps\dist folder):
XML
...
<Instances>
<Instance id="IsotopeHeadNode_IN_0" address="cluster-ip-address">
Create a new Windows Azure SQL Database instance, and add the clusters IP address to the list of IP
addresses that are permitted to access the server.
Open the query window in SQL Database and execute the script HiveMetaStore.sql that is provided in
the Hive distribution folder of HDInsight (C:\apps\dist\hive-version) to create the required tables in the
database.
In the Remote Desktop connection to your HDInsight cluster locate the file named hive-site.xml in the
C:\apps\dist\hive-version\conf folder, and edit the following properties to specify the connection details
for the SQL Database instance.
XML
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:sqlserver://
your-server-name.database.windows.net: 1433;
database=your-db-name; encrypt=true;
trustServerCertificate=true; create=false
</value>
</property>
...
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>your-username@your-server-name</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>your-server-password</value>
</property>
The values of each property must be all on one line with no spaces. We had to break them up here due
to page width limitations.
In the Remote Desktop window open the Services application from the Administrative Tools menu and
restart the Apache Hadoop metastore service.

After you restart the service HDInsight will store all of the metadata in your database, and it will not be deleted
or recreated each time (the connection string in the javax.jdo.option.ConnectionURL property includes the
instruction create=false).
This means that Hive table, partition, and index definitions are available in a new cluster that uses this database
as the metastore. If an existing cluster fails, or if you close it down and want to recreate it later, you do not need
to rebuild all of the metadata.
If you want to preserve the results of Hive queries you must ensure that they use the
EXTERNAL and LOCATION keywords when you create the tables so that the data is not
removed when the tables are deleted. See the section Managing Hive Table Data Location
and Lifetime in Chapter 4 for more information.
Automating Cluster Deletion and Provisioning
As you saw earlier in this chapter, you can use scripts, utilities, and custom .NET applications to interact with the
HDInsight service management API. The API includes operations that list, delete, and create clusters, and these
can be used to automate cluster deletion and provisioning operations.
For example, you could create a script that removes all of the clusters by iterating over the list exposed by the
API and deleting each one in turn. As long as metadata storage has been redirected to your own separate
Windows Azure SQL Database you will be able to recreate any clusters when you need to use them, but reduce
your costs in the meantime.
Alternatively, to provide disaster recovery, you could write a program that runs on a schedule to copy all of the
data from the Windows Azure blob storage used by the cluster to another location (including an on-premises
location if that is appropriate), and then export the SQL Database containing the metadata and save this backup
file in another location. Just be aware of the data transfer costs that this will incur, especially if you have a great
deal of data in your clusters blob storage.
Scripts can be copied to the cluster through a Remote Desktop connection. However, keep in mind that
these will not be available after you recreate the cluster.
Monitoring HDInsight Clusters and Jobs
There are several ways to monitor an HDInsight cluster and its operation:
Use the HDInsight cluster management portal.
Access the monitoring system directly.
Access the Hadoop monitoring pages.
Access the Hadoop log files directly.

The following sections of this chapter provide more information about each of these options.
Using the HDInsight Cluster Management Portal
After you connect to an HDInsight cluster management portal you can open the Monitor Cluster page. This page
displays a selection of charts that show the metrics for the cluster including the number of mappers and
reducers, the queue size, and the number of jobs. It also shows whether the name node and the job tracker are
working correctly, and the health state of each node in the cluster.
You drill down into each of the metrics and change the timescale for the chart from the default of the previous
30 minutes up to the previous two weeks, as shown in Figure 5.


Figure 5
Drilling down into the job history in the Monitor Cluster page

Viewing Job History
The main HDInsight management portal page also allows you to access the Job History page. This page displays
a list of previous and current jobs. It shows the command line that started the job, the start and end times, and
the status for each job. You can select a job to see more details, or download the entire job history for
examination in an application such as Microsoft Excel.
For more information about using the Windows Azure HDInsight management portal, see How to Monitor
HDInsight Service.
Accessing the Monitoring System Directly
The Monitor Cluster page in the Windows Azure HDInsight management portal uses the Ambari framework to
obtain the data from the cluster. You can access Ambari directly using a client tool that can construct and send
HTTPS requests to the WebHCat REST interface on port 563 of the name node to perform your own queries and
obtain more details. For example, you can connect to Ambari monitoring interface using the following URL:
https://your-cluster.azurehdinsight.net:563/ambari/api/monitoring
Some of the client tools suitable for making these kinds of requests are listed in the section Creating REST URLs
and Commands Directly earlier in this chapter. There is also a web client you can use to access Ambari in the
.NET SDK for Hadoop, which is described in the section Using Code to Access the REST APIs earlier in this
chapter.
For information about using Ambari see Ambari API Reference in GitHub.
Accessing the Hadoop Monitoring Pages
The section Accessing Performance Information in Chapter 4 described how to access and use the Hadoop
built-in websites to determine the behavior of a cluster in order to maximize performance. Several sections of
these websites can be used to obtain more general monitoring information such as physical operations carried
out by the job scheduler and the individual nodes in the cluster, the configuration settings applied to each job,
and the time taken for each stage of the jobs.
The easiest way to use access the Hadoop status and monitoring websites is to open a Remote Desktop
connection to the cluster from the main Windows Azure HDInsight portal and use the links named Hadoop
MapReduce Status and Hadoop Name Node Status located on the desktop of the remote virtual server. The Job
Tracker portal is accessed on port 50030 of your clusters head node. The Health and Status Monitor is accessed
on port 50070 of your clusters head node.
Accessing Hadoop Log Files Directly
With a Remote Desktop connection open to the name node of your cluster, you can browse the file system and
examine the log files that Hadoop generates. By default the job history files (including the stdout and stderror
files) are written to the central _logs\history folder of the Hadoop installation, and to the users job output
folder.
These log files can be accessed from the Hadoop command line using the commands shown here.
Command Line
$ bin/hadoop job -history <output-dir>
$ bin/hadoop job -history all <output-dir>
where <output-dir> is the folder containing the log files.
If required, you can change the folder where log files are created by setting the hadoop.job.history.location and
hadoop.job.history.user.location properties in the mapred-site.xml configuration file, or from the command
line by using the -D parameter.

You can prevent any log files from being written to the user folder by setting the
hadoop.job.history.user.location property value to none.

Changing the Logging Parameters
Hadoop uses the Apache log4j mechanism in conjunction with the Apache Commons Logging framework to
generate log files. The configuration for this mechanism can be found in the file log4j.properties in the conf\
folder. You can edit this file to specify non-default log formats and other logging settings if required.
Using Hadoop Metrics
Various parts of the Hadoop code expose metrics information that you can collect and use to analyze the
performance and health of a cluster. There are four contexts for the information: the Java virtual machine, the
distributed file system, the map/reduce core system, and the remote procedure context. Each can be configured
separately to save metrics into files that you can examine.
For more information, see Hadoop Metrics in the Cloudera blog and Package org.apache.hadoop.metrics2 on
the Apache website.
Summary
This chapter described the common tasks and operations you are likely to carry out when administering a
Windows Azure HDInsight cluster. It covered a range of management and monitoring tasks including interacting
with the cluster using a range of tools and techniques, setting configuration properties, recreating a cluster and
its data, and monitoring the cluster and the jobs it executes.
HDInsight supports a range of techniques for remote administration and monitoring. These include using the
HDInsight management portal pages, accessing the servers directly using a Remote Desktop connection, and
interacting with the service management APIs that HDInsight exposes.
This chapter does not attempt to cover every management option that is available, every configuration setting,
or every feature of the APIs and the tools. However, it does provide an overview of the possibilities and
techniques that you can use, and will help you to familiarize yourself with the tasks that you might need to carry
out when using HDInsight.
More Information
For more information about HDInsight, see the Windows Azure HDInsight web page.
For more information about the Microsoft .NET SDK for Hadoop, see the Codeplex project site.



Scenario 1: Iterative Data Exploration
This chapter marks a change in style for the guide. To supplement the general information and guidance in the
previous chapters, this and the following three chapters describe a series of practical scenarios for using a Big
Data solution such as HDInsight. Each scenario focuses on a fictitious organization that has a different
requirements, but which can be fulfilled by using HDInsight. The scenarios map to the four architectural
approaches described in Chapter 2 of this guide:
As a standalone data analysis experimentation and visualization tool.
As a data transfer, data cleansing, or ETL mechanism.
As a basic data warehouse or commodity storage mechanism.
Integration with an enterprise data warehouse and BI system.

In this first scenario, which maps to the standalone data analysis experimentation and visualization approach,
HDInsight is used to perform some basic analysis of data from Twitter. Social media analysis is a common Big
Data use case, and this scenario demonstrates how to extract information from semi-structured data in the form
of tweets. However, the goal of this scenario is not to provide a comprehensive guide to analyzing data from
Twitter, but rather to show an example of iterative data exploration with HDInsight and demonstrate some of
the techniques discussed in this guide. Specifically, this scenario demonstrates how to:
Identify a source of data and define the analytical goals.
Process data iteratively by using custom map/reduce code, Pig scripts, and Hive.
Review data processing results to see if they are both valid and useful.
Refine the solution to make it repeatable, and integrate it with the companys existing BI systems.


This chapter describes the scenario in general, and the solutions applied to meet the goals
of the scenario. If you would like to work through the solution to this scenario yourself you
can download the files and a step-by-step guide. These are included in the associated
Hands-on Labs available from http://wag.codeplex.com/releases/view/103405.
Finding Insights in Data
In Chapter 1 of this guide you saw that one of the typical uses of a Big Data solution such as HDInsight is to
explore data that you already have, or data you collect speculatively, to see if it can provide insights into
information that you can use within your organization. Chapter 1 used the question flow shown in Figure 1 as an
example of how you might start with a guess based on intuition, and progress towards a repeatable solution
that you can incorporate into your existing BI systems. Or, perhaps, to discover that there is no interesting
information in the data, but the cost of discovering this has been minimized by using a pay for what you use
mechanism that you can set up and then tear down again very quickly and easily.

Figure 1
The iterative exploration cycle for finding insights in data
Introduction to Blue Yonder Airlines
This scenario is based on a fictitious company named Blue Yonder Airlines. The company is an airline serving
passengers in the USA, and operates flights from its home hub at JFK airport in New York to Sea-Tac airport in
Seattle and LAX in Los Angeles. The company has a customer loyalty program named Blue Yonder Points.
Some months ago, the CEO of Blue Yonder Airlines was talking to a colleague who mentioned that he had seen
tweets that were critical of the companys seating policy. The CEO decided that the company should investigate
this possible source of valuable customer feedback, and instructed her customer service department to start
using Twitter as a means of communicating with its customers.
Customers send tweets to @BlueYonderAirlines and use the standard Twitter convention of including hashtags
to denote key terms in their messages. In order to provide a basis for analyzing these tweets, the CEO also asked
the BI team to start collecting any that mention @BlueYonderAirlines.
Analytical Goals and Data Sources
Initially the plan is simply to collect enough data to begin exploring the information it contains. To determine if
the results are both useful and valid, the team must collect enough data to generate a statistically valid result.
However, they do not want to invest significant resources and time at this point, and so use a manual interactive
process for collecting the data from Twitter using the public API.
Even though we dont know what we are looking for at this stage, we still have an analytical goal: to
explore the data and discover if, as we are led to believe, it really does contains useful information that could
provide a benefit for our business.
Although there is no specific business decision under consideration, the customer services managers believe
that some analysis of the tweets sent by customers may reveal important information about how they perceive
the airline and the issues that matter to customers. The kinds of question the team expects to answer are:
Are people talking about Blue Yonder Airlines on Twitter?
If so, are there any specific topics that regularly arise?
Of these topics, if any, is it possible to get a realistic view of which are the most important?
Does the process provide valid and useful information? If not, can it be refined to produce more
accurate and useful results?
If the results are valid and useful, can the process be incorporated into the companys BI systems to
make it repeatable?

Collecting and Uploading the Source Data
The BI team at Blue Yonder Airlines has been collecting data from Twitter and uploading it to Windows Azure
blob store ready to begin the investigation. They have gathered sufficient data over a period of weeks to ensure
a suitable sample for analysis. The source data and results will be retained in Windows Azure blob storage, for
visualization and further exploration in Excel, after the investigation is complete.
A full description of the process the team uses to collect the data from Twitter is not included here. However,
Appendix A of this guide describes the ways that the team at Blue Yonder Airlines developed and refined the
data collection process over time, as they became aware that the analysis would be able to provide useful and
valid information for the business.
If you are just experimenting with data so see if it useful, you probably wont want to spend inordinate
amounts of time and resources building a complex or automated data ingestion mechanism. Sometimes it just
easier and quicker to use the interactive approach!
The HDInsight Infrastructure
The IT department at Blue Yonder Airlines does not have the capital expenditure budget to purchase new
servers for the project, especially as the hardware may only be required for the duration of the project.
Therefore, the most appropriate infrastructure approach is to provision an HDInsight cluster on Windows Azure
that can be released when no longer required, minimizing the overall cost of the project.
To further minimize cost the team began with a single node HDInsight cluster, rather than accepting the default
of four nodes when initially provisioning the cluster. However, if necessary the team can provision a larger
cluster should the processing load indicate that more nodes are required. At the end of the investigation, if the
results indicate that valuable information can be extracted from the data, the team can fine tune the solution to
use a cluster with the appropriate number of nodes by using the techniques described in the section Tuning the
Solution in Chapter 4 of this guide.

For details of the cost of running an HDInsight cluster, see the pricing details on the Windows
Azure HDInsight website. In addition to the cost of the cluster you must pay for a storage
account to hold the data for the cluster.
Iterative Data Processing
After capturing a suitable volume of data and uploading it to Windows Azure blob storage the analysts in the BI
team can configure an HDInsight cluster associated with the blob container holding the data, and then turn their
attention to processing it. In the absence of a specific analytical goal, the data processing follows an iterative
pattern in which the analysts explore the data to see if they find anything that indicates specific topics of
interest for customers, and then build on what they find to refine the analysis.
At a high level, the iterative process breaks down into three phases:
Explore the analysts explore the data to determine what potentially useful information it contains.
Refine when some potentially useful data is found, the data processing steps used to query the data
are refined to maximize the analytical value of the results.
Stabilize when a data processing solution that produces useful analytical results has been identified, it
is stabilized to make it robust and repeatable.

The following sections of this chapter describe these stages in more detail.

Although many Big Data solutions will be developed using the stages described here, its
not mandatory. You may know exactly what information you want from the data, and
how to extract it. Alternatively, if you dont intend to repeat the process theres no point in
refining or stabilizing it.

Phase 1 Initial Data Exploration
In the first phase of the analysis, the team at Blue Yonder Airlines decided to explore the data speculatively by
coming up with a hypothesis about what information the data might be able to reveal, and using HDInsight to
process the data and generate results that validate the hypothesis. The goal for this phase is open-ended; the
exploration might result in a specific avenue of investigation that merits further refinement of a particular data
processing solution, or it might simply prove (or disprove) an assumption about the data.
Using Hive to Explore the Volume of Tweets
In most cases, the simplest way to start exploring data with HDInsight is to create a Hive table, and then query it
with HiveQL statements. The analysts at Blue Yonder Airlines created a Hive table based on the source data
obtained from Twitter. The following HiveQL code was used to define the table and load the source data into it.
HiveQL
CREATE EXTERNAL TABLE Tweets
(PubDate STRING,
TweetID STRING,
Author STRING,
Tweet STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/tweets';

LOAD DATA INPATH '/data/tweets.txt'
INTO TABLE Tweets;
The analysts hypothesized that the use of Twitter to communicate with the company is significant, and that the
volume of tweets that mention the company is growing. They therefore used the following query to determine
the daily volume and trend of tweets.
HiveQL
SELECT PubDate, COUNT(*) TweetCount
FROM Tweets
GROUP BY PubDate
SORT BY PubDate;
The results of this query are shown in the following table.
PubDate TweetCount
4/16/2013 1964
4/17/2013 2009
4/18/2013 2058
4/19/2013 2107
4/20/2013 2160
4/21/2013 2215
4/22/2013 2274
These results seem to validate the hypothesis that the volume of tweets is growing, and it may be worth refining
this query to include a larger set of source data that spans a longer time period, and potentially including other
aggregations in the results such as the number of distinct authors that tweeted each day. However, while this
analytical approach might reveal some information about the importance of Twitter as a channel for customer
communication, it doesnt provide any information about the specific topics that concern customers. To
determine whats important to the airlines customers, the analysts must look more closely at the actual
contents of the tweets.
Using Map/Reduce Code to Identify Common Hashtags
Twitter users often use hashtags to associate topics with their tweets. Hashtags are words prefixed with a #
character, and are typically used to filter searches and to find tweets about a specific subject. The analysts at
Blue Yonder Airlines hypothesized that analysis of the hashtags used in tweets addressed to the airlines account
might reveal topics that are important to customers. The team decided to find the hashtags in the data, count
the number of occurrences of each hash tag, and from this determine the main topics of interest.
Parsing the unstructured tweet text and identifying hashtags can be accomplished by implementing a custom
map/reduce solution. HDInsight includes a sample map/reduce solution called WordCount that counts the
number of words in a text source, and this can be easily adapted to count hashtags. The sample is provided in
various forms including Java, JavaScript, and C#. The developers at Blue Yonder Airlines are familiar with
JavaScript, and so they decided to implement the map/reduce code by modifying the JavaScript version of the
sample.
There are many ways to create and execute map/reduce functions, though often you can use Hive or Pig
directly instead of resorting to writing map/reduce code. However, there is a map/reduce code example available
that we can use as a starting point, which saves us time and effort. For simplicity we execute the code within Pig,
instead of executing it as a map/reduce job directly.
The script consists of a map function that runs on each cluster node in parallel and parses the text input to
create a key/value pair with the value of 1 for every word found. It was relatively simple to modify the code to
filter the words so that only those prefixed by # are included. These key/value pairs are passed to the reduce
function, which counts the total number of instances of each key. Therefore, the result is the number of times
each hash tag was mentioned in the source data. The modified JavaScript code is shown below.
JavaScript
// Map function - runs on all nodes
var map = function (key, value, context) {
// Split the data into an array of words
var hashtags = value.split(/[^0-9a-zA-Z#_]/);

// Loop through array creating a name/value pair with
// the value 1 for each word that starts with '#'
for (var i = 0; i < hashtags.length; i++) {
if (hashtags[i].substring(0, 1) == "#") {
context.write(hashtags[i].toLowerCase(), 1);
}
}
};

// Reduce function - runs on reducer node(s)
var reduce = function (key, values, context) {
var sum = 0;
// Sum the counts of each tag found in the map function
while (values.hasNext()) {
sum += parseInt(values.next());
}
context.write(key, sum);
};
The script was uploaded to the cluster as tagcount.js using the fs.put command, and executed by running the
following Pig command from the interactive console in an HDInsight server.
Interactive JavaScript Console
pig.from("/twitterdata/tweets")
.mapReduce("/twitterdata/tagcount.js",
"tag, count:long").orderBy("count DESC")
.take(5).to("/twitterdata/tagcounts");

Note that the statement in this listing and statements in listings elsewhere in this chapter
have been broken down over several lines due to limitations of the width of the printed
page.

This command executes the map/reduce job on the files in the /twitterdata/tweets folder (used by the tweets
table that was created previously), sorts the results by the number of instances of each hashtag found in the
data, filters the results to include only the top five hashtags, and stores the results in the
/twitterdata/tagcounts folder.
The job generates a file named part-r-00000 containing the top five most common hashtags and the total
number of instances of each one. To visualize the results, the analysts used the following commands in the
Interactive JavaScript console.
Interactive JavaScript
file = fs.read("/twitterdata/tagcounts");
data = parse(file.data, "tag, count:long");
graph.pie(data);

The pie chart generated by these commands is shown in Figure 2.

Figure 2
Results of a map/reduce job to find the top five hashtags

The pie chart reveals that the top five hashtags are:
#seatac (2004 occurrences)
#jfk (2003 occurrences)
#blueyonderairlines (1604 occurrences)
#blueyonder (1604 occurrences)
#blueyonderpoints (802 occurrences)

However, these results are not particularly useful in trying to identify the most common topics discussed in the
tweets, and so the analysts decided to widen the analysis beyond hashtags and try to identify the most common
words used in the text content of the tweets.
Using Pig to Group and Summarize Word Counts
It would be possible to use the original WordCount sample to count the number of occurrences per word in the
tweets, but the analysts decided to use Pig in order to simplify any customization of the code that may be
required in the future. Pig provides a workflow-based approach to data processing that is ideal for restructuring
and summarizing data, and a Pig Latin script that performs the aggregation is syntactically much easier to create
than implementing the equivalent custom map/reduce code.
Pig makes it easy to write procedural workflow-style code that builds a result by repeated operations on
a dataset. It also makes it easier to debug queries and transformations because you can dump the intermediate
results of each operation in the script to a file. However, before you commit a Pig Latin script to final production
where it will be executed repeatedly, make sure you apply performance optimizations to it as described in
Chapter 4 of this guide.
The Pig Latin code created by the analysts is shown below. It uses the source data files in the
/twitterdata/tweets folder to group the number of occurrences of each word, and then sorts the results in
descending order of occurrences and stores the first 100 results in the /twitterdata/wordcounts folder.
Pig Latin
-- load tweets
Tweets = LOAD '/twitterdata/tweets' AS (date, id, author, tweet);

-- split tweet into words
TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;

-- clean words by removing punctuation
CleanWords = FOREACH TweetWords GENERATE
LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0)) as word;

-- filter text to eliminate empty strings.
FilteredWords = FILTER CleanWords BY word != '';

-- group by word
GroupedWords = GROUP FilteredWords BY (word);

-- count mentions per group
CountedWords = FOREACH GroupedWords GENERATE group,
COUNT(FilteredWords) as count;

-- sort by count
SortedWords = ORDER CountedWords BY count DESC;

-- limit results to the top 100
Top100Words = LIMIT SortedWords 100;

-- store the results as a file
STORE Top100Words INTO '/twitterdata/wordcounts';
This script is saved as WordCount.pig and executed from the Pig installation folder on the cluster by using the
following command in the Hadoop command console:
Command Line
pig C:\Scripts\WordCount.pig
When the script has completed successfully, the results are stored in a file named part-r-00000 in the
/twitterdata/wordcounts folder. These can be viewed using the #cat command. Partial results are shown
below.
Extract from /twitterdata/wordcounts/part-r-00000
my 3437
delayed 2749
flight 2749
to 2407
entertainment 2064
the 2063
a 2061
delay 1720
of 1719
bags 1718
These results show that the word count approach has the potential to reveal some insights. For example, the
high number of occurrences of delayed and delay are likely to be relevant in determining common customer
concerns. However, the solution needs to be modified to restrict the output to include only significant words,
which will improve its usefulness. To accomplish this the analysts decided to refine it to produce accurate and
meaningful insights into the most common words used by customers when communicating with the airline by
Twitter.
Phase 2 Refining the Data Processing Solution
After you find a question that does provide some useful insight into the data, its usual to attempt to refine it to
maximize the relevance of the results, minimize the processing overhead, and therefore improve its value. This
is typically the second phase of an analysis process.

Typically you will only refine solutions that you intend to repeat on a reasonably regular
basis. Be sure that they are still correct and produce valid results afterwards. For example,
compare the outputs of the original and the refined solutions.
Having determined that counting the number of occurrences of each word has the potential to reveal common
topics of customer tweets, the analysts at Blue Yonder Airlines decided to refine the data processing solution to
improve its accuracy and usefulness. The starting point for this refinement is the WordCount.pig Pig Latin script
described previously.
Enhancing the Pig Latin Code to Exclude Noise Words
The results generated by the WordCount.pig script contain a large volume of everyday words such as a, the,
of, and so on. These occur frequently in every tweet but provide little or no semantic value in trying to identify
the most common topics discussed in the tweets. Additionally, there are some domain-specific words such as
flight and airport that, within the context of discussions around airlines, are so common that their
usefulness in identifying topics of conversation is minimal. These noise words (sometimes called stop words) can
be disregarded when analyzing the tweet content.
The analysts decided to create a file containing the stop list of words to be excluded from the analysis, and then
use it to filter the results generated by the Pig Latin script. This is a common way to filter words out of searches
and queries. An extract from the noise word file created by the analysts at Blue Yonder Airlines is shown below.
This file was saved as noisewords.txt in the /twitterdata folder in the HDInsight cluster.
Extract from /twitterdata/noisewords.txt
a
i
as
do
go
in
is
it
my
the
that
what
flight
airline
terrible
terrific

You can obtain lists of noise words from various sources such as TextFixer and Armand Brahajs Blog. If you have
installed SQL Server you can start with the list of noise words that are included in the Resource database. For
more information, see Configure and Manage Stopwords and Stoplists for Full-Text Search.
With the noise words file in place, the analysts modified the WordCount.pig script to use a LEFT OUTER JOIN
matching the words in the tweets with the words in the noise list. Only words with no matching entry in the
noise words file are then included in the aggregated results. The modified script is shown below.
Pig Latin
-- load tweets
Tweets = LOAD '/twitterdata/tweets' AS (date, id, author, tweet);

-- split tweet into words
TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;

-- clean words by removing trailing periods
CleanWords = FOREACH TweetWords GENERATE
LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0)) as word;

-- filter text to eliminate empty strings.
FilteredWords = FILTER CleanWords BY word != '';

-- load the noise words file
NoiseWords = LOAD '/twitterdata/noisewords.txt' AS noiseword:chararray;

-- join the noise words file using a left outer join
JoinedWords = JOIN FilteredWords BY word LEFT OUTER,
NoiseWords BY noiseword USING 'replicated';

-- filter the joined words so that only words with
-- no matching noise word remain
UsefulWords = FILTER JoinedWords BY noiseword IS NULL;

-- group by word
GroupedWords = GROUP UsefulWords BY (word);

-- count mentions per group
CountedWords = FOREACH GroupedWords GENERATE group,
COUNT(UsefulWords) as count;
-- sort by count
SortedWords = ORDER CountedWords BY count DESC;

-- limit results to the top 100
Top100Words = LIMIT SortedWords 100;

-- store the results as a file
STORE Top100Words INTO '/twitterdata/noisewordcounts';
Partial results from this version of the script are shown below.
Extract from /twitterdata/noisewordcounts/part-r-00000
delayed 2749
entertainment 2064
delay 1720
bags 1718
service 1718
time 1375
vacation 1031
food 1030
wifi 1030
connection 688
seattle 688
bag 687
These results are more useful than the previous output, which included the noise words. However, the analysts
have noticed that semantically equivalent words are counted separately. For example, in the results shown
above, delayed and delay both indicate that customers are concerned about delays, while bags and bag
both indicate concerns about baggage.
Using a User-Defined Function to Find Similar Words
After some consideration, the analysts realized that multiple words with the same stem (for example delay
and delayed), different words with the same meaning (for example baggage and luggage), and mistyped
or misspelled words (for example bagage") will skew the resultspotentially causing topics that are important
to customers to be obscured because many differing terms are used as synonyms. To address this problem, the
analysts initially decided to try to find similar words by calculating the Jaro distance between every word
combination in the source data.
The Jaro distance is a value between 0 and 1 used in statistical analysis to indicate the level
of similarity between two words. For more information about the algorithm used to calculate
Jaro distance, see JaroWinkler distance on Wikipedia.
There is no built-in Pig function to calculate Jaro distance, so the analysts recruited the help of a Java developer
to create a user-defined function. The code for the user-defined function is shown below:
Java
package WordDistanceUDF;

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class WordDistance extends EvalFunc<Double>
{
public Double exec(final Tuple input) throws IOException{
if (input == null || input.size() == 0)
return null;
try
{
final String firstWord = (String)input.get(0);
final String secondWord = (String)input.get(1);
return getJaroSimilarity(firstWord, secondWord);
}
catch(Exception e)
{
throw new IOException("Caught exception processing input row ", e);
}
}

private Double getJaroSimilarity(final String firstWord, final String secondWord)
{
double defMismatchScore = 0.0;
if (firstWord != null && secondWord != null)
{
// get half the length of the string rounded up
//(this is the distance used for acceptable transpositions)

final int halflen = Math.min(firstWord.length(),
secondWord.length()) / 2 + 1;
//get common characters
final StringBuilder comChars =
getCommonCharacters(firstWord, secondWord, halflen);

final int commonMatches = comChars.length();

//check for zero in common
if (commonMatches == 0)
{
return defMismatchScore;
}

final StringBuilder comCharsRev =
getCommonCharacters(secondWord, firstWord, halflen);
//check for same length common strings returning 0.0f is not the same
if (commonMatches != comCharsRev.length())
{
return defMismatchScore;
}
//get the number of transpositions
int transpositions = 0;
for (int i = 0; i < commonMatches; i++)
{
if (comChars.charAt(i) != comCharsRev.charAt(i))
{
transpositions++;
}
}
//calculate jaro metric
transpositions = transpositions / 2;
defMismatchScore = commonMatches / (3.0 * firstWord.length()) +
commonMatches / (3.0 * secondWord.length()) +
(commonMatches - transpositions) / (3.0 * commonMatches);
}
return defMismatchScore;
}

private static StringBuilder getCommonCharacters(final String firstWord,
final String secondWord,
final int distanceSep)
{
if (firstWord != null && secondWord != null)
{
final StringBuilder returnCommons = new StringBuilder();
final StringBuilder copy = new StringBuilder(secondWord);
for (int i = 0; i < firstWord.length(); i++)
{
final char character = firstWord.charAt(i);
boolean foundIt = false;
for (int j = Math.max(0, i - distanceSep);
!foundIt && j < Math.min(i + distanceSep, secondWord.length());
j++)
{
if (copy.charAt(j) == character)
{
foundIt = true;
returnCommons.append(character);
copy.setCharAt(j, '#');
}
}
}
return returnCommons;
}
return null;
}
}
This function was compiled, packaged as WordDistanceUDF.jar, and saved in the Pig installation folder on the
HDInsight cluster. Next, the analysts created a Pig Latin script that generates a list of all non-noise word
combinations in the tweet source data and uses the function to calculate the Jaro distance between each
combination. This script is shown below:
Pig Latin
-- register custom jar
REGISTER WordDistanceUDF.jar;

-- load tweets
Tweets = LOAD '/twitterdata/tweets' AS (date, id, author, tweet);

-- split tweet into words
TweetWords = FOREACH Tweets GENERATE
FLATTEN(TOKENIZE(tweet)) AS word;

-- clean words by removing trailing periods
CleanWords = FOREACH TweetWords GENERATE
LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0)) as word;

-- filter text to eliminate empty strings.
FilteredWords = FILTER CleanWords BY word != '';

-- load the noise words file
NoiseWords = LOAD '/twitterdata/noisewords.txt' AS noiseword:chararray;

-- join the noise words file using a left outer join
JoinedWords = JOIN FilteredWords BY word LEFT OUTER,
NoiseWords BY noiseword USING 'replicated';

-- filter the joined words so that only words
-- with no matching noise word remain
UsefulWords = FILTER JoinedWords BY noiseword IS NULL;

-- group by word
GroupedWords = GROUP UsefulWords BY (word);

-- count mentions per group
WordList = FOREACH GroupedWords GENERATE group as word:chararray;

-- sort by count
SortedWords = ORDER WordList BY word;

-- create a duplicate set
SortedWords2 = FOREACH SortedWords GENERATE word AS word:chararray;

-- cross join to create every combination of pairs
CrossWords = CROSS SortedWords, SortedWords2;

-- find the Jaro distance
WordDistances = FOREACH CrossWords GENERATE
SortedWords::word as word1:chararray,
SortedWords2::word as word2:chararray,
WordDistanceUDF.WordDistance(SortedWords::word,
SortedWords2::word) AS jarodistance:double;

-- filter out word pairs with jaro distance less than 0.9
MatchedWords = FILTER WordDistances BY jarodistance >= 0.9;

-- store the results as a file
STORE MatchedWords INTO '/twitterdata/matchedwords';

Notice that the script filters the results to include only word combinations with a Jaro
distance value of 0.9 or higher.

To view the results of running the script, the analysts used the Data Explorer add-in in Excel to connect to the
HDInsight cluster and create a table from the part-r-00000 file that was generated in the
/twitterdata/matchedwords folder, as shown in Figure 3.

Figure 3
Using the Data Explorer add-in to explore Pig output
Partial results in the output file are shown below.
Extract from /twitterdata/matchedwords/part-r-00000
bag bag 1.0
bag bags 0.9166666666666665
baggage baggage 1.0
bagage bagage 1.0
bags bag 0.9166666666666665
bags bags 1.0
delay delay 1.0
delay delayed 0.9047619047619047
delay delays 0.9444444444444444
delayed delay 0.9047619047619047
delayed delayed 1.0
delays delay 0.9444444444444444
delays delays 1.0
seat seat 1.0
seat seats 0.9333333333333333
seated seated 1.0
seats seat 0.9333333333333333
seats seat 1.0
The results include a row for every word matched to itself with a Jaro score of 1.0, and two rows for each
combination of words with a score of 0.9 or above (one row for each possible word order). Close examination of
these results reveals that, while the code has successfully matched some words appropriately (for example,
bag / bags, delay / delays, delay / delayed, and seat / seats), it has failed to match some others
(for example, delays / delayed, bag / baggage, bagage / baggage, and seat / seated).
The analysts experimented with the Jaro value used to filter the results, lowering it to achieve more matches.
However, in doing so they found that the number of false positives increased. For example, lowering the filter
score to 0.85 matched delays to delayed and seats to seated, but also matched seated to seattle.
The results we obtained, and the attempts to improve them by adjusting the matching algorithm, reveal
just how difficult it is to infer semantics and sentiment from free-form text. In the end we realized that it would
require some type of human intervention, in the form of a manually maintained synonyms list.
Using a Lookup File to Combine Matched Words
The results generated by the user-defined function were not sufficiently accurate to rely on in a fully automated
process for matching words, but they did provide a useful starting point for creating a custom lookup file for
words that should be considered synonyms when analyzing the tweet contents. The analysts reviewed and
extended the results produced by the function in Excel and created the following tab-delimited text file, which
was saved as synonyms.txt in the /twitterdata folder on the HDInsight cluster.
Extract from /twitterdata/synonyms.txt
bag bag
bag bags
bag bagage
bag baggage
bag luggage
bag lugage
delay delay
delay delayed
delay delays
drink drink
drink drinks
drink drinking
drink beverage
drink beverages
food food
food feed
food eat
food eating
movie movie
movie movies
movie film
movie films
seat seat
seat seats
seat seated
seat seating
seat sit
seat sitting
seat sat
The first column in this file contains the list of leading values that should be used to aggregate the results. The
second column contains synonyms that should be converted to the leading values for aggregation.
With this synonyms file in place, the Pig Latin script used to count the words in the tweet contents was modified
to use a LEFT OUTER JOIN between the words in the source tweets (after filtering out the noise words) and the
words in the synonyms file to find the leading values for each matched word. A UNION clause is then used to
combine the matched words with words that are not present in the synonyms file. The revised Pig Latin script is
shown here.
Pig Latin
-- load tweets
Tweets = LOAD '/twitterdata/tweets' AS (date, id, author, tweet);

-- split tweet into words
TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;

-- clean words by removing trailing periods
CleanWords = FOREACH TweetWords GENERATE
LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0)) as word;

-- filter text to eliminate empty strings.
FilteredWords = FILTER CleanWords BY word != '';

-- load the noise words file
NoiseWords = LOAD '/twitterdata/noisewords.txt' AS noiseword:chararray;

-- join the noise words file using a left outer join
JoinedWords = JOIN FilteredWords BY word LEFT OUTER,
NoiseWords BY noiseword USING 'replicated';

-- filter the joined words so that only words with
-- no matching noise word remain
UsefulWords = FILTER JoinedWords BY noiseword IS NULL;

-- Match synonyms
Synonyms = LOAD '/twitterdata/synonyms.txt'
AS (leadingvalue:chararray, synonym:chararray);
WordsAndSynonyms = JOIN UsefulWords BY word LEFT OUTER,
Synonyms BY synonym USING 'replicated';
UnmatchedWords = FILTER WordsAndSynonyms BY synonym IS NULL;
UnmatchedWordList = FOREACH UnmatchedWords GENERATE word;
MatchedWords = FILTER WordsAndSynonyms BY synonym IS NOT NULL;
MatchedWordList = FOREACH MatchedWords GENERATE leadingvalue as word;
AllWords = UNION MatchedWordList, UnmatchedWordList;

-- group by word
GroupedWords = GROUP AllWords BY (word);

-- count mentions per group
CountedWords = FOREACH GroupedWords GENERATE group as word,
COUNT(AllWords) as count;

-- sort by count
SortedWords = ORDER CountedWords BY count DESC;

-- limit results to the top 100
Top100Words = LIMIT SortedWords 100;

-- store the results as a file
STORE Top100Words INTO '/twitterdata/synonymcounts';

Partial results from this script are shown below.
Extract from /twitterdata/synonymcounts/part-r-00000
delay 4812
bag 3777
seat 2404
service 1718
movie 1376
vacation 1375
time 1375
entertainment 1032
food 1030
wifi 1030
connection 688
seattle 688
drink 687

In these results, semantically equivalent words are combined into a single leading value for aggregation. For
example the counts for seat, seats, and seated are combined as a single count for seat. This makes the
results more useful in terms of identifying topics that are important to customers. For example, it is apparent
from these results that the top three subjects that customers have tweeted about are delays, bags, and seats.
Moving from Topic Lists to Sentiment Analysis
At this point the team has a valuable and seemingly valid list of the most important topics that customers are
discussing. However, the results do not include any indication of sentiment. While its reasonably safe to assume
that the majority of people discussing delays (the top term) will not be happy customers, its more difficult to
detect the sentiment for a top item such as bag or seat. Its not possible to tell from this data if most
customers are happy or are complaining.
If you are considering implementing a sentiment analysis mechanism, stop and think about how you
might categorize text extracts such as Still in Chicago airport due to flight delays - what a great trip this is
turning out to be! Its not as easy as it looks, and the results are only ever likely to be approximations of reality.
Sentiment analysis is difficult. As an extreme example, a tweet containing text such as no wifi, thanks for that
Blue Yonder could be either a positive or a negative sentiment. There are solutions available especially
designed to accomplish sentiment analysis, and the team might also investigate using the rudimentary
sentiment analysis capabilities of the Twitter search API (see Appendix A for more details).
However, they decided that the list of topics provided by the analysis theyve done so far would probably be
better employed in driving more focused customer research, such as targeted questionnaires and online
surveysperhaps through external agencies that specialize in this field.
Phase 3 Stabilizing the Solution
With the data processing solution refined to produce useful insights, the analysts at Blue Yonder Airlines decided
to stabilize the solution to make it more robust and repeatable. Key concerns about the solution include the
dependencies on hard-coded file paths, the data schemas in the Pig Latin scripts, and the data ingestion process.
With the current solution, any changes to the location or format of the tweets.txt, noisewords.txt, or
synonyms.txt files would break the current scripts. Such changes are particularly likely to occur if the solution
gains acceptance among users and Hive tables are created on top of the files to provide a more convenient
query interface.
Using HCatalog to Create Hive Tables
HCatalog provides an abstraction layer between data processing tools such as Hive, Pig, and custom map/reduce
code and the file system storage of the data. By using HCatalog to manage data storage the analysts hoped to
remove location and format dependencies in Pig Latin scripts, while making it easy for users to access the
analytical data directly though Hive.
The analysts had already created a table named Tweets for the source data, and now decided to create tables
for the noise words file and the synonyms file. Additionally, they decided to create a table for the results
generated by the WordCount.pig script to make it easier to perform further queries against the aggregated
word counts. To accomplish this the analysts created the following script file containing HiveQL statements,
which they saved as CreateTables.hcatalog on the HDInsight server.
HiveQL
CREATE EXTERNAL TABLE NoiseWords (noiseword STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/noisewords';
CREATE EXTERNAL TABLE Synonyms (leadingvalue STRING, synonym STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/synonyms';
CREATE EXTERNAL TABLE TopWords (word STRING, count BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/twitterdata/topwords';
To execute this script, the following command was executed in the HCatalog installation directory.
Command Line
hcat.py f C:\Scripts\CreateTables.hcatalog
After the script had successfully created the new tables, the noisewords.txt and synonyms.txt files were moved
into the folders used by the corresponding tables by executing the following commands.
Command Line
hadoop dfs mv /twitterdata/noisewords.txt /twitterdata/noisewords/noisewords.txt
hadoop dfs mv /twitterdata/synonyms.txt /twitterdata/synonyms/synonyms.txt
Using HCatalog in a Pig Latin Script
Moving the noisewords.txt and synonyms.txt files made the data available by querying the corresponding Hive
tables; but broke the WordCount.pig script, which includes a hard-coded path to these files. The script also uses
hard-coded paths to load the tweets.txt file from the folder used by the Tweets table, and to save the results.
This dependency on hard-coded paths makes the data processing solution vulnerable to changes in the way that
data is stored. For example, if at a later date an administrator modifies the Tweets table to partition the data,
the code to load the source data would no longer work. Additionally, if the source data was modified to use a
different format or schema, the script would need to be modified accordingly.
One of the major advantages of using HCatalog is that I can relocate data as required without breaking
all the scripts and code that accesses the files.
To eliminate the dependency, the analysts modified the WordCount.pig script to use HCatalog classes to load
and save data in Hive tables instead of accessing the source files directly. The modified script is shown below.
Pig Latin
-- load tweets using HCatalog
Tweets = LOAD 'Tweets' USING org.apache.hcatalog.pig.HCatLoader();

-- split tweet into words
TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;

-- clean words by removing trailing periods
CleanWords = FOREACH TweetWords GENERATE
LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0)) as word;

-- filter text to eliminate empty strings.
FilteredWords = FILTER CleanWords BY word != '';

-- load the noise words file using HCatalog
NoiseWords = LOAD 'NoiseWords'
USING org.apache.hcatalog.pig.HCatLoader();

-- join the noise words file using a left outer join
JoinedWords = JOIN FilteredWords BY word LEFT OUTER,
NoiseWords BY noiseword USING 'replicated';

-- filter the joined words so that only words with
-- no matching noise word remain
UsefulWords = FILTER JoinedWords BY noiseword IS NULL;

-- Match synonyms using data loaded through HCatalog
Synonyms = LOAD 'Synonyms'
USING org.apache.hcatalog.pig.HCatLoader();
WordsAndSynonyms = JOIN UsefulWords BY word LEFT OUTER,
Synonyms BY synonym USING 'replicated';
UnmatchedWords = FILTER WordsAndSynonyms BY synonym IS NULL;
UnmatchedWordList = FOREACH UnmatchedWords GENERATE word;
MatchedWords = FILTER WordsAndSynonyms BY synonym IS NOT NULL;
MatchedWordList = FOREACH MatchedWords GENERATE leadingvalue as word;
AllWords = UNION MatchedWordList, UnmatchedWordList;

-- group by word
GroupedWords = GROUP AllWords BY (word);

-- count mentions per group
CountedWords = FOREACH GroupedWords GENERATE group as word,
COUNT(AllWords) as count;

-- sort by count
SortedWords = ORDER CountedWords BY count DESC;

-- limit results to the top 100
Top100Words = LIMIT SortedWords 100;

-- store the results as a file using HCatalog
STORE Top100Words INTO 'TopWords'
USING org.apache.hcatalog.pig.HCatStorer();
The script no longer includes any hard-coded paths to data files or schemas for the data as it is loaded or stored,
and instead uses HCatalog to reference the Hive tables created previously. To execute this script the following
commands are used to set the HCAT_HOME environment variable and use HCatalog from Pig.
Command Line
SET HCAT_HOME=C:\apps\dist\hcatalog-0.4.1
Pig useHCatalog C:\Scripts\WordCount.pig
The results of the script are stored in the TopWords table, and can be viewed by executing a HiveQL query such
as the following example.
HiveQL
SELECT * FROM TopWords;
Finalizing the Data Ingestion Process
Having stabilized the solution, the team finally examined ways to improve and automate the way that data is
extracted from Twitter and loaded into Windows Azure blob storage. The solution they adopted uses SQL Server
Integration Services with a package that connects to the Twitter streaming API every day to download new
tweets, and uploads these as a file to blob storage.
For more information about the three approaches that the Blue Yonder Airlines team experimented with for
ingesting the source data, see Appendix A of this guide.
Fine Tuning the Solution
The Blue Yonder Airlines team also plans to further refine the process over time by experimenting with some
common optimization techniques. Areas that the team will investigate include:
Using the SEQUENCEFILE format for the data instead of text. For more information about using
Sequence Files, see the section Resolving Performance Limitations by Optimizing the Query in Chapter
4 of this guide.
Applying compression to the data in the intermediate stages of the query, and compressing the source
data. For more information about using compression, see the section Resolving Performance
Limitations by Compressing the Data in Chapter 4 of this guide.
Loading the data directly into Windows Azure blob storage instead of using the LOAD DATA command.
For more information about loading Hive tables, see the section Loading Data into Hive Tables in
Chapter 4 of this guide.
Reviewing the columns used in the query. For example, using only the Tweet column rather than storing
other columns that are not used in the current process. This might be possible by adding a Hive staging
table and preprocessing the data. However, the team needs to consider that the data in other columns
may be useful in the future.

Consuming the Results
Now that the solution has generated some useful information, the team can decide how to consume the results
for analysis or reporting. Storing the results of the Pig data processing jobs generated by the WordCount.pig
script in a Hive table makes it easy for analysts to consume the data from Excel.
In time we might create our own custom tools to analyze the results and generate business information
by combining it with other data we have available in our BI systems. We are also considering using a custom
application to generate charts and graphics that can be embedded in other applications. Because we can access
the results in Hive using the ODBC driver, there is almost no limit to what we can achieve!
To enable access to Hive from Excel, the ODBC driver for Hive has been installed on all of the analysts
workstations and a data source name (DSN) has been created for Hive on the HDInsight cluster. The analysts can
use the Data Connection Wizard in Excel to connect to HDInsight using the DSN, and import data from the
TopWords table, as shown in Figure 4.

Figure 4
Using the Data Connection Wizard to access a Hive table from Excel
After the data has been imported into a worksheet, the analysts can use the full capabilities of Excel to explore
and visualize it, as shown in Figure 5.


Figure 5
Visualizing results in Excel
Future integration of the data with the companys existing BI system is planned. To see examples of how you can
integrate data from HDInsight with an existing BI system, see the scenario Enterprise BI Integration in this
guide and in the related Hands-on Labs available for this guide.
Summary
In this scenario you have seen an example of a Big Data analysis process that includes obtaining source data and
uploading it to Windows Azure blob storage; processing the data by using custom map/reduce code, Pig, and
Hive; stabilizing the solution using HCatalog; and analyzing the results of the data processing in the HDInsight
interactive console and in Microsoft Excel.
The scenario demonstrates how Big Data analysis is often an iterative process. It starts with open-ended
exploration of the data, which is then refined and stabilized to create an analytical solution. In the scenario
youve seen in this chapter the exploration led to information about the topics that are important to customers.
From here, further iterations could aggregate the word counts by date to identify trends over periods of time, or
more complex semantic analysis could be performed on the tweet content to determine customer sentiment.
A key benefits of using HDInsight for this kind of data exploration is that the Windows Azure-based cluster can
be created, expanded, and deleted as required; enabling you to manage costs by paying for only the processing
and storage resources that you use, and eliminate the need to configure and manage your own hardware and
software resources or modify existing internal BI infrastructure.
If you would like to work through the solution to this scenario yourself, you can download the files and a step-
by-step guide. These are included in the associated Hands-on Labs available from
http://wag.codeplex.com/releases/view/103405.



Scenario 2: Extract, Transform, Load
In this scenario HDInsight is used to perform an Extract, Transform, and Load (ETL) process on data to validate it
for correctness, detect possible fraudulent transactions, and then populate a database table. Specifically, this
scenario demonstrates how to:
Use custom map/reduce components to extract and transform XML formatted data.
Use Hive queries to validate data and detect anomalies.
Use Sqoop to load the final data into a Windows Azure SQL Database table.
Encapsulate ETL tasks in an Oozie workflow.
Use the Microsoft .NET SDK for Hadoop to automate an ETL workflow in HDInsight.


This chapter describes the scenario in general, and the solutions applied to meet the goals
of the scenario. If you would like to work through the solution to this scenario yourself, you
can download the files and a step-by-step guide. These are included in the related Hands-
on Labs available from http://wag.codeplex.com/releases/view/103405.
Introduction to Northwind Traders
This scenario is based on a fictitious company named Northwind Traders that has stores located in several cities
in North and South America. These stores sell a wide range of goods to trade customers that have accounts with
the company, and the stores also accept payment from retail customers in cash and as credit/debit card
payments.
Northwind is a franchise operation, and so does not directly own or control the stores. Each store has its own
internal accounting mechanisms, but must report transactions daily to Northwind so that it can monitor the
operations of the business centrally, understand trading patterns, and plan for future growth.
The administrative team at the Northwind Traders head office is responsible for processing the lists of
transactions and storing them in the companys central database. As part of this processing, the team will
ensure that the data is valid and also attempt to identify any fraudulent activity.
ETL Process Goals and Data Sources
Northwind is primarily a management and marketing organization, and it has a considerable presence on the
web. To minimize costs and to accommodate future business expansion the management team at Northwind
decided to take advantage of the elastic scalability of Windows Azure, and uses a range of Windows Azure
services to host all of its IT operations. This includes using Windows Azure SQL Database to store the corporate
data.
Each night, all of the Northwind stores submit a summary of the days transactions electronically. This is an XML
document that is captured by the head office and stored in Windows Azure blob storage in an HDInsight cluster
managed by the administrators at the head office.
The head office maintains a list of stores, together with the opening hours of each one, and uses this
information to generate a unique store ID for each transaction before loading the data into the head office SQL
Database. As part of the processing of each transaction file, the head office also wants to verify that the sales
occurred during store opening hours in order to detect any fraudulent activity. In addition, any sale over $10,000
in total value is to be flagged in the database so that audit reports can identify these transactions for possible
investigation.
After the data has been transformed, validated, and loaded into the database the head office team will be able
to use Windows Azure SQL Reporting and other reporting tools to examine the data, generate reports, create
visualizations, power business dashboards, and provide other management information. These tasks are not
explored in this scenario, which concentrates only on the ETL process.

For more information about how you can generate reports and visualizations from data after
loading it into a database, see Chapter 5 of this guide.
The ETL Workflow
To understand the ETL workflow requirements the administrative team at Northwind Traders started by
examining the source format of the store transaction data, and the schema of the target Windows Azure SQL
Database table, to identify the tasks that must be performed to transfer the data from source to target.
The Source Schema
The source files contain XML representations of transactions for a specific store on a specific day, as shown in
the following example.
XML
<?xml version="1.0" encoding="utf-8"?>
<transactions>
<transaction>
<date>1/1/2013 10:25:56 AM</date>
<location>1AREPE</location>
<cashsale>
<saleAmount>8849.38</saleAmount>
</cashsale>
</transaction>
<transaction>
<date>1/1/2013 10:26:08 AM</date>
<location>1AREPE</location>
<storecardfirstissue>
<cardNumber>xxxx-xxxx-xxxx-5809</cardNumber>
</storecardfirstissue>
</transaction>
<transaction>
<date>1/1/2013 10:26:41 AM</date>
<location>1AREPE</location>
<creditaccountsale>
<account>891116</account>
<accountLoc>1LIMPE</accountLoc>
<saleAmount>7295.49</saleAmount>
</creditaccountsale>
</transaction>
</transactions>
The XML source includes a <transaction> element for each transaction, which includes the date and time in a
<date> element and the store location in a coded form that specifies the region-specific store number, the city,
and the country or region. There are multiple possible transaction types including cash sales, credit sales, and
store card issues, each with their own set of properties.

Notice that the different types of transactions have different schemas. For example, store card sales
include the card number, while credit account sales include the account number. Cash sales include neither of
these values. The processing mechanism we implement must be able to cope with this variability and generate
output that conforms to our target schema.

The Target Schema
The target table for the ETL process is a Windows Azure SQL Database table created with the following Transact-
SQL code.
Transact-SQL
CREATE TABLE [dbo].[StoreTransaction]
(
[TransactionLocationId] [int] NULL,
[TransactionDate] [date] NULL,
[TransactionTime] [time](7) NULL,
[TransactionType] [nvarchar](256) NULL,
[Account] [nvarchar](256) NULL,
[AccountLocationId] [int] NULL,
[CardNumber] [nvarchar](256) NULL,
[Amount] [float] NULL,
[Description] [nvarchar](256) NULL,
[Audit] [tinyint] NULL
);
GO

CREATE CLUSTERED INDEX [IX_LocationId_Date_Time]
ON [dbo].[StoreTransaction]
(
[TransactionLocationId] ASC,
[TransactionDate] ASC,
[TransactionTime] ASC
);
GO
Notice that the target table includes a numeric TransactionLocationId column (of type integer), so the
alphanumeric store code in the source data must be converted to a unique numeric identifier during the ETL
process. The process can look up the store identifiers based on the country or region, city, and local store
number in local reference tables. Additionally, the target table includes separate TransactionDate and
TransactionTime columns for the date and time of the transaction, which must be derived from the <date>
value in the source XML.
Transaction records that have missing values for the store location or the date cannot be loaded into the target
table, and must be redirected to an alternative location where they can be reviewed by a data steward. Finally,
notice that the target table includes an Audit column that indicates rows that should be audited by a manager,
based on business rules about transactions that occur outside standard store opening hours or that involve
amounts above a threshold of $10,000. The ETL process must identify these rows and set the Audit flag value
accordingly.

A clustered index has been created on the table. In addition to improving query performance, a clustered
index is required for tables that will be loaded using Sqoop.

The ETL Tasks
The transformations that must be applied to the data as it flows through the ETL process can be performed by
loading the data into a sequence of Hive tables, and using HiveQL queries to filter and transform the data from
one table as it is loaded into the next. To use this approach the XML data must be parsed and loaded into the
first Hive table in the sequence, and then a Sqoop command can be used at the end of the process to transfer
the data in the final table into the Windows Azure SQL Database table. The sequence of tasks for the ETL
workflow is shown in Figure 1.

Figure 1
An overview of the Northwind ETL process
Parsing the XML Source Data
The first task is to parse the XML source data so that it can be exposed for processing as a Hive table. The
developers at Northwind Traders considered how this could be accomplished. The source data files contain a
series of <transaction> elements within a <transactions> root element. However, the files are typically too large
for a standard XML parser to read into memory.
This problem could be overcome by using a streaming XML parser, or by reading the document in chunks as
partial XML segments, and attempting to parse each chunk. However, this means that a single error in the file,
or if the division into chunks does not fall between <transaction> elements, may cause loss of data and
unpredictable results. It also adds complexity to the solution.
The developers decided that a better approach would be to split the incoming data into separate valid XML
documents, one for each transaction, where each document contains a single <transaction> element as the root
element. Each of these documents can then be parsed using a standard XML parser, and the code can handle
any that are invalid without affecting the processing of subsequent transactions.
To split the source files into individual transaction documents the developers created a custom input formatter,
written using Java. This input formatter can then be used in a range of ways, depending on future requirements.
For example, it could be called from a user-defined function that is used in a Hive or a Pig query, or it could be
used in custom map components. It could also be used directly in a Hive query that uses XPath to parse the
contents of the XML documents.
After considering the options the developers decided that, in order to provide greater flexibility and
extensibility, they would create custom map and reduce components written in Java. The mapper component
uses the custom input formatter to split the file into individual transaction records (each a valid XML document),
and then uses an XML parser to extract the values of the child elements in each separate transaction document.
The extracted values are written out to a file in tab-delimited format.
Figure 2 shows an overview of the stages of parsing the XML document. The subsequent sections of this chapter
describe each stage of this process.


Figure 2
Parsing the XML into a Hive table using a custom input formatter and mapper=

When you are designing custom components for map/reduce jobs its worth considering whether they
might be reusable in similar scenarios, and as you refine your data processes. Of course, you should balance
the cost and effort required to implement reusability with the possible gains it might provide.
The final stage for the current scenario, as shown in Figure 2, is to define a Hive table over the collection of tab-
delimited files, and execute queries against this table to process the data as required. This approach means that
the developers can use the tab-delimited file for other purposes if required, including defining different Hive
schemas over it or processing it using Pig or another query mechanism.
The Custom Input Formatter
The custom input formatter splits the transaction data into records based on the <transaction> elements in the
XML source files. It outputs a series of records, one for each <transaction> element. With the exception of the
initial record and records where there is an error in the source data, each of these records is a valid XML
document with the <transaction> root element, and contains the XML child elements and values in the source.

Input formatters do not create new documents or files, but simply instruct the mapper on
how to read the input file. The custom input formatter used in this example tells a
map/reduce job how to read an XML document, in the same way that the default text
input formatter tells any map/reduce job how to read a text file.
The custom input formatter is implemented as a class named XMLInputFormat that extends
FileInputFormat<LongWritable, Text> and overrides the createRecordReader method to return an instance of
the custom XmlRecordReader class.
Java
package northwindETL;
...
public class XMLInputFormat extends FileInputFormat<LongWritable, Text>
{

@Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit inputSplit, TaskAttemptContext context)
{
return new XmlRecordReader();
}
The custom XmlRecordReader class extends RecordReader<LongWritable, Text>, and overrides its members to
generate the required output from the source data. The XmlRecordReader class reads the source data and splits
it into sections, then outputs each section as a separate record.
Java
public static class XmlRecordReader extends RecordReader<LongWritable, Text>
{
FileSplit fSplit;
byte[] delimiter;
long start;
long length;
long counter = 0;
FSDataInputStream inputStream;
DataOutputBuffer buffer = new DataOutputBuffer();
LongWritable key = new LongWritable();
Text value = new Text();
...
The override of the initialize method instantiates a FileSplit instance, reads the configuration, opens the file, and
sets the file pointer at the start of the input file.
Java
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException
{
String endTag = "</transaction>";
delimiter = endTag.getBytes("UTF8");
fSplit = (FileSplit) split;

// Get position of the first byte in the file to process
start = fSplit.getStart();

// Get length
length = fSplit.getLength();

//Get the file containing split's data
Path file = fSplit.getPath();

FileSystem fs = file.getFileSystem(context.getConfiguration());
inputStream = fs.open(file);
inputStream.seek(start);
}
The override of the nextKeyValue method, which is executed when the RecordReader must generate the next
record, reads the source data and outputs the section of the source data up to and including the next
</transaction> tag.
Java
@Override
public boolean nextKeyValue() throws IOException, InterruptedException
{
int delimiterLength = delimiter.length;

while(counter < length)
{
int temp = inputStream.read();
counter++;
buffer.write(temp);
if (temp == delimiter[0])
{
for(int i=1; i < delimiterLength; i++)
{
temp = inputStream.read();
counter++;
buffer.write(temp);
if(temp != delimiter[i])
{
break;
}
if (i == delimiterLength -1)
{
key.set(inputStream.getPos());
value.set(buffer.getData(), 0, buffer.getLength());
buffer.reset();
return true;
}
}
}
}
return false
}

Notice that, for simplicity in this example, the input formatter actually just divides the
input into records that end with the closing </transaction> tag. This means that the first
record will start with the initial <transactions> root element tag. You will see later how
this is handled in the mapper component.
The code also overrides several other methods including getCurrentKey and getCurrentValue to get a key and
value from the data, getProgress to indicate how much of the data has be processed so far, and close to close
the input file.

The complete source code for the custom input formatter, and for the other Java components described in this
chapter, is included in the Hands-on Labs associated with this guide. To obtain the labs go to
http://wag.codeplex.com/releases/view/103405. The files are in the folder named Lab 2 - Extract Transform
Load\Labfiles\CustomCode\src\northwindETL.

The Custom Mapper Component
The custom input formatter converts the source file into a series of records, each of which should be a valid XML
document with a <transaction> root element. This allows the mapper component to use an XML parser to read
each document in turn and generate the output required. In this scenario, the output is tab-delimited text
records, each of which contains the values from one <transaction> document.
The custom ParseTransactionMapper class extends the Hadoop Mapper class and implements the map method
that is called by Hadoop to generate the output from the map stage. The map method instantiates a
DocumentBuilder (which has a StringReader pointing to the input XML as the InputSource), parses the
document into a new Document instance, and instantiates a StringBuilder to hold the output that it will
generate.
Notice how the component strips any extraneous leading text from the record before parsing it. This includes
the initial <transactions> root element tag, and any text that may remain from the previous record if there was
an error in the data.
Java
package northwindETL;
...
public class ParseTransactionMapper extends Mapper<Object, Text, Text, Text>
{
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException
{
// Remove any extraneous leading text
String data = value.toString();
int i = data.indexOf("<transaction>");
String xml = data.substring(i, data.length());

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);

try
{
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xml));

try
{
Document doc = db.parse(is);
StringBuilder sbk = new StringBuilder();
...
The map method then walks through the document extracting the values based on the known schema, and adds
each value to the StringBuilder separated by a tab character. The data contains records that may contain only
some of the values; for example, a cash sale does not include an account number or a credit card number.
Where these optional values are not present the code simply outputs a tab character. This means that all the
output rows have the same number of columns, with empty values for the missing values in the input records.

Our custom input formatter and mapper components are specific to the format of the source data
because the differing schemas of each transaction type make it much more difficult and costly to create
configurable, general purpose components.
The code also performs some translations on the data. For example, after extracting the location code for the
transaction, the code splits the data and time into separate columns, as shown here.
Java
...
sbk.append(extractLocation(doc));
sbk.append('\t');
Date date;
try
{
date = new SimpleDateFormat("M/d/yyyy h:mm:ss a")
.parse(extractDate(doc));

sbk.append(new SimpleDateFormat("yyyy-MM-dd").format(date));
sbk.append('\t');
sbk.append(new SimpleDateFormat("kk:mm:ss").format(date));
}
catch (ParseException e)
{
sbk.append("");
sbk.append('\t');
sbk.append("");
}
...

The code uses methods named extractLocation and extractDate to extract the specific items from the XML
document. These methods, together with similar methods to extract other values, are located elsewhere in the
source code file.
To build the remaining part of each output record the code uses another StringBuilder, to which it adds the
remaining columns for the record. Finally, it outputs the contents of the two StringBuilder instances to the
context.
Java
...
StringBuilder sbv = new StringBuilder();
sbv.append(extractTransactionType(doc));
sbv.append('\t');
sbv.append(extractAccount(doc));
sbv.append('\t');
sbv.append(extractAccountLocation(doc));
sbv.append('\t');
sbv.append(extractCardNumber(doc));
sbv.append('\t');
sbv.append(extractAmount(doc));
sbv.append('\t');
sbv.append(extractDescription(doc));

context.write(new Text(sbk.toString()), new Text(sbv.toString()));
The remainder of the code is concerned with error handling, and the definition of the methods that extract
specific values from the source document. You can examine the entire source file in the Hands-on Labs
associated with this guide
Executing the Map/Reduce Job
The following code is used to execute the map/reduce job and generate the delimited text version of the
transaction data. Notice that it specifies the custom ParseTransactionMapper as the mapper class for the job,
and the custom XMLInputFormat class as the input formatter for the job.
Java
package northwindETL;
...
import northwindETL.XMLInputFormat;

public class XMLProcessing
{
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

if (otherArgs.length != 2)
{
System.err.println("Usage: xmlprocessing <in> <out>");
System.exit(2);
}

Job job = new Job(conf, "XML Processing");
job.setJarByClass(XMLProcessing.class);
job.setMapperClass(ParseTransactionMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(XMLInputFormat.class);
MultiXmlElementInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The map/reduce code is packaged as a .jar file and is executed on the cluster.
The Hive Statement
After the tab-delimited text file has been created by the map/reduce job, the following HiveQL statement can be
used to define a Hive table over the results. Where there are empty values in the records in the text file, the
associated columns in the row will contain a null value.
HiveQL
CREATE EXTERNAL TABLE TransactionRaw
(
TransactionLocation STRING,
TransactionDate STRING,
TransactionTime STRING,
TransactionType STRING,
Account STRING,
AccountLocation STRING,
CardNumber STRING,
Amount DOUBLE,
Description STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED by '9'
STORED AS TEXTFILE
LOCATION '/etl/parseddata';
Normalizing the Store Codes
The location IDs in the source data are specific to the geographical region where the transaction occurred. The
location ID consists of an encoded value that indicates the city-specific store number, the city, and the country
or region where the store is located. Globally unique IDs for each store are maintained in a Hive table, with the
city and the country or region details stored in two more tables. These lookup tables are based on the following
HiveQL schema definitions.
HiveQL
CREATE EXTERNAL TABLE Store
(Id INT,
CityId INT,
LocalId INT,
LocalCode STRING,
Description STRING,
OpeningTime STRING,
ClosingTime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED by '9'
STORED AS TEXTFILE
LOCATION '/etl/lookup/store/';

CREATE EXTERNAL TABLE City
(Id INT,
CountryRegionId INT,
LocalId INT,
LocalCode STRING,
Description STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED by '9'
STORED AS TEXTFILE
LOCATION '/etl/lookup/city/';

CREATE EXTERNAL TABLE CountryRegion
(Id INT,
Code STRING,
Description STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED by '9'
STORED AS TEXTFILE
LOCATION '/etl/lookup/countryregion/';

A requirement to perform some transformation on values in the source data before storing it in your BI
system is typical of most ETL processes, especially where the source data comes from a legacy system or a system
that was not designed to support the data interchange process you require. In our case, the store identifiers in
the source data have a completely different format to those in our BI system, and must be transformed using
lookup tables.
The ETL workflow must look up the correct unique store ID for each transaction and load the resulting data into
a staging table that is based on the following HiveQL schema definition.

HiveQL
CREATE TABLE TransactionStaging
(
TransactionLocation STRING,
TransactionLocationId INT,
TransactionDate STRING,
TransactionTime STRING,
TransactionType STRING,
Account STRING,
AccountLocation STRING,
AccountLocationId INT,
CardNumber STRING,
Amount DOUBLE,
Description STRING
)
LOCATION '/etl/hive/transactionstaging/';

To accomplish this the following HiveQL code is used to query the TransactionRaw table and insert the results
into the TransactionStaging table.

HiveQL
SET hive.auto.convert.join = true;

INSERT INTO TABLE TransactionStaging
SELECT TR.TransactionLocation,
ST.Id,
TR.TransactionDate,
TR.TransactionTime,
TR.TransactionType,
TR.Account,
TR.AccountLocation,
ST2.Id,
TR.CardNumber,
TR.Amount,
TR.Description
FROM TransactionRaw TR
LEFT OUTER JOIN CountryRegion CR
ON substr(TR.TransactionLocation,length(TR.TransactionLocation)-1,2) = CR.Code
LEFT OUTER JOIN city CT
ON substr(TR.TransactionLocation,length(TR.TransactionLocation)-4,3) = CT.LocalCode
AND CR.Id = CT.CountryRegionId
LEFT OUTER JOIN store ST
ON substr(TR.TransactionLocation,1,length(TR.TransactionLocation)-5) = ST.LocalCode
AND CT.Id = ST.CityId
LEFT OUTER JOIN CountryRegion CR2
ON substr(TR.AccountLocation,length(TR.AccountLocation)-1,2) = CR2.Code
LEFT OUTER JOIN city CT2
ON substr(TR.AccountLocation,length(TR.AccountLocation)-4,3) = CT2.LocalCode
AND CR2.Id = CT2.CountryRegionId
LEFT OUTER JOIN store ST2
ON substr(TR.AccountLocation,1,length(TR.AccountLocation)-5) = ST2.LocalCode
AND CT2.Id = ST2.CityId;

Redirecting Invalid Rows
A transaction is considered invalid if any of the following is true:
The transaction location ID is null.
The transaction date is empty.
The transaction time is empty.
The account location is not empty, but the account location ID is null.

Rows in which any of these validation rules fails must be redirected to a separate table for investigation later.
The table for invalid rows is based on the following HiveQL schema definition.
HiveQL
CREATE TABLE TransactionInvalid
(
TransactionLocation STRING,
TransactionDate STRING,
TransactionTime STRING,
TransactionType STRING,
Account STRING,
AccountLocation STRING,
CardNumber STRING,
Amount DOUBLE,
Description STRING
)
LOCATION '/etl/hive/transactioninvalid/';
To identify invalid rows and load them into the TransactionInvalid table, the ETL process uses the following
HiveQL query.
HiveQL
INSERT INTO TABLE TransactionInvalid
SELECT TS.TransactionLocation,
TS.TransactionDate,
TS.TransactionTime,
TS.TransactionType,
TS.Account,
TS.AccountLocation,
TS.CardNumber,
TS.Amount,
TS.Description
FROM TransactionStaging TS
WHERE TS.TransactionLocationId IS NULL
OR TS.TransactionDate = ''
OR TS.TransactionTime = ''
OR (TS.AccountLocation <> '' AND TS.AccountLocationId IS NULL);
Setting the Audit Flag
The final transformation that the ETL process must perform before loading the data into Windows Azure SQL
Database is to set the Audit flag. To accomplish this, a Hive table containing an Audit column was created using
the following HiveQL code.
HiveQL
CREATE TABLE TransactionAudit
(
TransactionLocationId INT,
TransactionDate STRING,
TransactionTime STRING,
TransactionType STRING,
Account STRING,
AccountLocationId INT,
CardNumber STRING,
Amount DOUBLE,
Description STRING,
Audit TINYINT
)
LOCATION '/etl/hive/transactionaudit/';
Valid rows are then loaded into this table from the TransactionStaging table by executing the following HiveQL
query.
HiveQL
SET hive.auto.convert.join = true;

INSERT INTO TABLE TransactionAudit
SELECT TS.TransactionLocationId,
TS.TransactionDate,
TS.TransactionTime,
TS.TransactionType,
TS.Account,
TS.AccountLocationId,
TS.CardNumber,
TS.Amount,
TS.Description,
IF(Amount > 10000
OR TS.TransactionTime < ST.OpeningTime
OR TS.TransactionTime > ST.ClosingTime,
1, 0)
FROM TransactionStaging TS
INNER JOIN Store ST
ON TS.TransactionLocationId = ST.Id
WHERE TS.TransactionLocationId IS NOT NULL
AND TS.TransactionDate <> ''
AND TS.TransactionTime <> ''
AND (TS.AccountLocation = '' OR TS.AccountLocationId IS NOT NULL);
We need to identify several types of transactions for further attention. Transactions that take place
outside store hours are obviously suspicious, and we also want to monitor those with a total value over a
specified amount. These transactions can be marked with an audit flag in the database, allowing managers to
easily select them for investigation. However, transactions that are invalid because they have an incorrect store
code, or are missing data such as the time and date, cannot be stored in the database because this would violate
the schema and table restraints. Therefore we redirect these to a separate Hive table instead.

Loading the Data into Windows Azure SQL Database
After the data has been validated and transformed, the ETL process loads it from the folder on which the
TransactionAudit table is based into the StoreTransaction table in Windows Azure SQL Database by using the
following Sqoop command.
Command Line
Sqoop export --connect "jdbc:sqlserver://abcd1234.database.windows.net:1433;
database=NWStoreTransactions;username=NWLogin@abcd1234;
password=Pa$$w0rd;logintimeout=30;"
--table StoreTransaction
--export-dir /etl/hive/transactionaudit
--input-fields-terminated-by \0001
--input-null-non-string \\N
Encapsulating the ETL Tasks in an Oozie Workflow
Having defined the individual tasks that make up the ETL process, the administrators at Northwind Traders
decided to encapsulate the ETL process in an Oozie workflow. By using Oozie, the process can be parameterized
and automated; making it easier to reuse on a regular basis.
Figure 3 (below) shows an overview of the Oozie workflow that Northwind will implement. Notice that a core
requirement, in addition to each of the processing and query tasks, is to be able to define a workflow that will
execute these tasks in the correct order and, where appropriate, wait until each one completes before starting
the next one.
You can see that the workflow contains a series of tasks that are processed in turn. The majority of the tasks use
data generated by the previous task, and so must complete before the next one starts. However, one of the
tasks, HiveCreateTablesFork, contains four individual tasks to create the Hive tables required by the ETL
process. These four tasks can execute in parallel because they have no interdependencies. The only requirement
is that all of them must complete before the next task, HiveAuditStoreHoursAndAmount, can begin. Youll see
how the workflow for these tasks in defined later in this chapter.
If any of the tasks should fail, the workflow executes the kill task. This generates a message containing details
of the error, abandons any subsequent tasks, and halts the workflow. As long as there are no errors, the
workflow ends when the final task that loads the data into SQL Database has completed.


Figure 3
The Oozie workflow for the Northwind ETL process
Defining the Workflow
The Oozie workflow, shown below, is defined in a Hadoop Process Definition Language (hPDL) file named
workflow.xml. This file contains a <start> and an <end> element that define the where the workflow process
starts and ends, and a <kill> element that defines the action to take when prematurely halting the workflow if
an error occurs. The remainder of the workflow definition consists of a series of actions that Oozie will execute
when processing the workflow.
hPDL
<workflow-app xmlns="uri:oozie:workflow:0.2" name="ETLWorkflow">

<start to="ParseAction"/>

<action name='ParseAction'>

<fork name="HiveCreateTablesFork">
<path start="HiveCreateRawTable" />
<path start="HiveCreateStagingTable" />
<path start="HiveCreateInvalidTable" />
<path start="HiveCreateAuditTable" />
</fork>

<action name="HiveCreateRawTable">
...
</action>

<action name="HiveCreateStagingTable">
...
</action>

<action name="HiveCreateInvalidTable">
...
</action>

<action name="HiveCreateAuditTable">
...
</action>

<join name="HiveCreateTablesJoin" to="HiveNormalizeStoreCodes" />

<action name="HiveNormalizeStoreCodes">
...
</action>

<action name="HiveFilterInvalid">
...
</action>

<action name="HiveAuditStoreHoursAndAmount">
...
</action>

<action name="SqoopTransferData">
...
</action>

<kill name="fail">
<message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>

<end name="end"/>

</workflow-app>
Notice the <fork> element, which specifies four actions that can be performed in parallel instead of waiting for
each one to finish before starting the next. The <join> element a little further down the file indicates the point
where all previous tasks must be finished before continuing with the next one. The <join> element specifies that
the next task after the <fork> has completed execution is the HiveNormalizeStoreCodes action. It, and the
subsequent three tasks, will execute in turn with each one starting only when the previous one has finished.
The previous listing omitted the details of each <action> to make it easier to see the structure of the workflow
definition file. Well look in more detail at some of the individual actions next.
The complete workflow.xml file and the other files described in this chapter such as the Hive scripts are
included in the Hands-on Labs associated with this guide. To obtain the labs go to
http://wag.codeplex.com/releases/view/103405. The files are in the folder named Lab 2 - Extract Transform
Load\Labfiles\OozieWorkflow.
Executing the Map/Reduce Code to Parse the Source Data
The <start> element in the workflow.xml file specifies that the first action to execute is the one named
ParseAction. This action defines the map/reduce job described earlier in this chapter. The definition of the
action is shown below. Notice that it specifies all of the parameters required to run the job, including details of
the job tracker and the name node, the Java classes to use as the input formatter and the mapper, the name of
the root node in the XML source data (required by the input formatter), and the type and location of the input
and output files.
hPDL
<action name='ParseAction'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${outputAsvFullPath}"/>
</prepare>
<configuration>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>northwindETL.ParseTransactionMapper</value>
</property>
<property>
<name>mapreduce.inputformat.class</name>
<value>northwindETL.MultiXmlElementInputFormat</value>
</property>
<property>
<name>xmlinput.element</name>
<value>transaction</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${inputDir}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to='HiveCreateTablesFork'/>
<error to='fail'/>
</action>

Several of the values used by this workflow action are parameters that are set in the job configuration, and are
populated when the workflow executes. The syntax ${...} denotes a parameter that is populated at runtime.
Youll see how the values are obtained from the job configuration later in this chapter.
At the end of every action in the workflow is an <ok> and an <error> element. These define the name of the
action to execute next, and the name of the <kill> action to execute if there is an error.
The actions in a workflow definition file do not need to be placed in the order they will be executed,
because each action defines the one to execute next. However, it makes it easier to understand and debug the
workflow if they are defined in the same order as the execution plan!

Creating the Hive Tables
The four actions in the workflow that create the Hive tables are fundamentally similar; the only differences are
the name of the Hive script to execute, name and location of the table to be created, and the name of the next
action to execute. The following extract from the workflow.xml file shows the definition of the
HiveCreateRawTable action.
hPDL
<action name="HiveCreateRawTable">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>HiveCreateRawTable.q</script>
<param>TABLE_NAME=${outputTable}</param>
<param>LOCATION=${outputDir}</param>
</hive>
<ok to="HiveCreateTablesJoin"/>
<error to="fail"/>
</action>
You can see the name of the Hive script in the <script> element near the end of the definition. Below it are two
parameters that specify the table name and the location. As in the ParseAction definition shown earlier, the
values for these parameters are obtained from the job configuration using the ${...} syntax. The difference here
is that the values from configuration are themselves parameters, which are passed to the Hive scripts at
runtime.
The following table shows the values of the script name, table name, and table location used in the three other
actions that create Hive tables.

Action Script name Table name Location
HiveCreateStagingTable HiveCreateStagingTable.q ${stagingTable} ${stagingLocation}
HiveCreateInvalidTable HiveCreateInvalidTable.q ${invalidTable} ${invalidLocation}
HiveCreateAuditTable HiveCreateAuditTable.q ${auditTable} ${auditLocation}
Executing the Hive Queries
There are three actions that execute Hive queries to perform the transformations on the data and populate the
Hive tables. The first action, HiveNormalizeStoreCodes, is shown here. Again, the <script> element specifies the
name of the script to execute, and the <param> elements specify values to be passed to the Hive script at
runtime.
hPDL
<action name="HiveNormalizeStoreCodes">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>HiveNormalizeStoreCodes.q</script>
<param>RAW_TABLE_NAME=${outputTable}</param>
<param>COUNTRY_REGION_TABLE_NAME=${countryRegionTable}</param>
<param>CITY_TABLE_NAME=${cityTable}</param>
<param>STORE_TABLE_NAME=${storeTable}</param>
<param>STAGING_TABLE_NAME=${stagingTable}</param>
</hive>
<ok to="HiveFilterInvalid"/>
<error to="fail"/>
</action>
The two subsequent actions that execute Hive scripts to populate the tables are HiveFilterInvalid and
HiveAuditStoreHoursAndAmount. The HiveFilterInvalid action contains the following definition of the script
name and parameters.
hPDL
<script>HiveFilterInvalid.q</script>
<param>STORE_TABLE_NAME=${storeTable}</param>
<param>STAGING_TABLE_NAME=${stagingTable}</param>
<param>INVALID_TABLE_NAME=${invalidTable}</param>
The HiveAuditStoreHoursAndAmount action contains the following definition of the script name and
parameters.
hPDL
<script>HiveAuditStoreHoursAndAmount.q</script>
<param>STORE_TABLE_NAME=${storeTable}</param>
<param>STAGING_TABLE_NAME=${stagingTable}</param>
<param>INVALID_TABLE_NAME=${invalidTable}</param>
<param>AUDIT_TABLE_NAME=${auditTable}</param>
Loading the SQL Database Table
After the HiveAuditStoreHoursAndAmount action has completed, the next to execute is the SqoopTransferData
action. This action uses Sqoop to load the data into the Windows Azure SQL Database table. The complete
definition of this action is shown here.
hPDL
<action name="SqoopTransferData">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<arg>export</arg>
<arg>--connect</arg>
<arg>${connectionString}</arg>
<arg>--table</arg>
<arg>${targetSqlTable}</arg>
<arg>--export-dir</arg>
<arg>${auditLocation}</arg>
<arg>--input-fields-terminated-by</arg>
<arg>\0001</arg>
<arg>--input-null-non-string</arg>
<arg>\\N</arg>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
You can see that it also uses several parameters that are set from configuration to specify the connection string
of the database, the target table name, and the location of the file generated by the final Hive query in the
workflow.

Parameterizing the Workflow Tasks
Each of the tasks that require the execution of a HiveQL query references a file with the extension .q; for
example, the HiveCreateRawTable task references HiveCreateRawTable.q. These files contain parameterized
versions of the HiveQL statements described previously. For example, HiveCreateRawTable.q contains the
following code.
HiveQL
DROP TABLE IF EXISTS ${TABLE_NAME};

CREATE EXTERNAL TABLE ${TABLE_NAME}
( TransactionLocation STRING
, TransactionDate STRING
, TransactionTime STRING
, TransactionType STRING
, Account STRING
, AccountLocation STRING
, CardNumber STRING
, Amount DOUBLE
, Description STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED by '9'
STORED AS TEXTFILE
LOCATION '${LOCATION}';

The parameter values for each task are specified as variables in workflow.xml, and they are set in a
job.properties file as shown in the following example.
job.properties file
nameNode=asv://nwstore@nwstore.blob.core.windows.net
jobTracker=jobtrackerhost:9010
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=/etl/oozieworkflow/
outputAsvFullPath=asv://nwstore@nwstore.blob.core.windows.net/etl/parseddata/
inputDir=/etl/source/
outputDir=/etl/parseddata/
outputTable=TransactionRaw
stagingTable=TransactionStaging
stagingLocation=/etl/hive/transactionstaging/
invalidTable=TransactionInvalid
invalidLocation=/etl/hive/transactioninvalid/
auditTable=TransactionAudit
auditLocation=/etl/hive/transactionaudit/
countryRegionTable=CountryRegion
cityTable=City
storeTable=Store
connectionString=jdbc:sqlserver://abcd1234.database.windows.net:1433;database=NWStoreT
ransactions;user=NWLogin@abcd1234;password=Pa$$w0rd;encrypt=true;trustServerCertificat
e=true;loginTimeout=30;
targetSqlTable=StoreTransaction
The ability to abstract settings in a separate file makes the ETL workflow more flexible. It can be easily adapted
to handle future changes in the environment, such as a requirement to use alternative HDFS folder locations or
a different Windows Azure SQL Database instance.
Executing the Oozie Workflow
To execute the Oozie workflow the administrators upload the workflow files to HDFS and the job.properties file
to the local file system on the HDInsight cluster, and then run the following command on the cluster.
Command Line
oozie job -oozie http://localhost:11000/oozie/
-config c:\ETL\job.properties
-run
When the Oozie job starts, the command line interface displays the unique ID assigned to the job. The
administrators can then view the progress of the job by using the browser on the HDInsight cluster to display job
status at http://localhost:11000/oozie/v0/job/the_unique_job_id?show=log.
Automating the ETL Workflow with the Microsoft .NET SDK for Hadoop
Using an Oozie workflow simplifies ETL execution and makes it easier to repeat the process. However, the
administrators must interactively upload the workflow files and use a Remote Desktop connection to the
HDInsight cluster to run the workflow each time. To further automate the ETL process, a developer was tasked
with creating an ETL process management utility that can be run from an on-premises client computer.
Importing the .NET SDK Libraries
To create the ETL management application, the developer created a new Microsoft Visual C# console application
and used the package manager to import the Microsoft.Hadoop.WebClient library from NuGet.
PowerShell
install-package Microsoft.Hadoop.WebClient
This library includes the Microsoft.Hadoop.WebHDFS and Microsoft.Hadoop.WebClient.OozieClient
namespaces, which contain the classes that will be used to upload the workflow files and execute the Oozie job
remotely.
Defining the Application Settings
To provide flexibility, the Oozie workflow was designed to use variables and parameters. To accommodate this
flexibility in the management application, the developer added the following entries to the appSettings section
of the App.settings file.
XML
<appSettings>
<add key="ClusterURI" value="https://nwstore.azurehdinsight.net:563" />
<add key="HadoopUser" value="NWuser" />
<add key="HadoopPassword" value="!Pa$$w0rd!" />
<add key="StorageKey" value="abcdefghijklmn123456789=" />
<add key="ContainerName" value="nwstore" />
<add key="StorageName" value="nwstore" />
<add key="NameNodeHost" value="asv://nwstore@nwstore.blob.core.windows.net" />
<add key="WorkflowDir" value="/etl/workflow/" />
<add key="InputDir" value="/etl/source/" />
<add key="OutputDir" value="/etl/parseddata/" />
<add key="RawTableName" value="transactionraw" />
<add key="StagingTableName" value="transactionstaging" />
<add key="StagingLocation" value="/etl/hive/transactionstaging/" />
<add key="InvalidTableName" value="transactioninvalid" />
<add key="InvalidLocation" value="/etl/hive/transactioninvalid/" />
<add key="AuditTableName" value="transactionaudit" />
<add key="AuditLocation" value="/etl/hive/transactionaudit/" />
<add key="CountryRegionTable" value="countryregion" />
<add key="CityTable" value="city" />
<add key="StoreTable" value="store" />
<add key="SqlConnectionString" value=
"jdbc:sqlserver://abcd1234.database.windows.net:1433;
database=NWStoreTransactions;user=NWLogin@abcd1234;password=Pa$$w0rd;
encrypt=true;trustServerCertificate=true;loginTimeout=30;" />
<add key="TargetSqlTable" value="StoreTransaction" />
</appSettings>
These settings include the properties that were previously set in the job.properties file, as well as the settings
required to connect to the HDInsight cluster and the Windows Azure blob storage it uses.
Creating the Application Code
The code for the application starts by implementing the Main method, in which it asynchronously executes a
private method named UploadWorkflowFilesAndExecuteOozieJob. This method executes two other methods
named UploadWorkflowFiles and CreateAndExecuteOozieJob, as shown in the following sample.
C#
namespace ETLWorkflow
{
...
public class Program
{
internal static void Main(string[] args)
{
UploadWorkflowFilesAndExecuteOozieJob().Wait();
}

private static async Task UploadWorkflowFilesAndExecuteOozieJob()
{
await UploadWorkflowFiles();
await CreateAndExecuteOozieJob();
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
}
Uploading the Workflow Files
As the name suggests, the UploadWorkflowFiles method uploads the files required to execute the workflow to
HDInsight. When compiled, the OozieJob folder (which contains the workflow files and the lib subfolder
containing the NorthwindETL.jar map/reduce component) is stored in the same folder as the executable.
The method creates an instance of the WebHDFSClient class from the .NET SDK for Hadoop using values loaded
from the configuration file, and uses the new instance to delete any existing workflow folder and contents
before uploading all of the files located in the local workflow folder.
C#
private static async Task UploadWorkflowFiles()
{
try
{
var workflowLocalDir = new DirectoryInfo(@".\OozieJob");

var hadoopUser = ConfigurationManager.AppSettings["HadoopUser"];
var storageKey = ConfigurationManager.AppSettings["StorageKey"];
var storageName = ConfigurationManager.AppSettings["StorageName"];
var containerName = ConfigurationManager.AppSettings["ContainerName"];
var workflowDir = ConfigurationManager.AppSettings["WorkflowDir"];

var hdfsClient = new WebHDFSClient(hadoopUser,
new BlobStorageAdapter(storageName, storageKey, containerName, false));

await hdfsClient.DeleteDirectory(workflowDir);

foreach (var file in workflowLocalDir.GetFiles())
{
await hdfsClient.CreateFile(file.FullName, workflowDir + file.Name);
}

foreach (var dir in workflowLocalDir.GetDirectories())
{
await hdfsClient.CreateDirectory(workflowDir + dir.Name);

foreach (var file in dir.GetFiles())
{
await hdfsClient.CreateFile(file.FullName, workflowDir +
dir.Name + "/" + file.Name);
}
}
}
catch (Exception ex)
{
throw ex;
}
}
Creating and Executing the Oozie Job
Next, the CreateAndExecuteOozieJob method executes. This method reads all of the parameter values required
by the workflow from configuration, and uses some to create instances of the OozieHttpClient class and
OozieJobProperties class from the .NET SDK for Hadoop.
C#
private static async Task CreateAndExecuteOozieJob()
{
var hadoopUser = ConfigurationManager.AppSettings["HadoopUser"];
var hadoopPassword = ConfigurationManager.AppSettings["HadoopPassword"];
var nameNodeHost = ConfigurationManager.AppSettings["NameNodeHost"];
var workflowDir = ConfigurationManager.AppSettings["WorkflowDir"];
var inputDir = ConfigurationManager.AppSettings["InputDir"];
var outputDir = ConfigurationManager.AppSettings["OutputDir"];
var rawTableName = ConfigurationManager.AppSettings["RawTableName"];
var stagingTableName = ConfigurationManager.AppSettings["StagingTableName"];
var stagingLocation = ConfigurationManager.AppSettings["StagingLocation"];
var auditTableName = ConfigurationManager.AppSettings["AuditTableName"];
var auditLocation = ConfigurationManager.AppSettings["AuditLocation"];
var invalidTableName = ConfigurationManager.AppSettings["InvalidTableName"];
var invalidLocation = ConfigurationManager.AppSettings["InvalidLocation"];
var countryRegionTable = ConfigurationManager
.AppSettings["CountryRegionTable"];
var cityTable = ConfigurationManager.AppSettings["CityTable"];
var storeTable = ConfigurationManager.AppSettings["StoreTable"];
var sqlConnectionString = ConfigurationManager
.AppSettings["SqlConnectionString"];
var targetSqlTable = ConfigurationManager.AppSettings["TargetSqlTable"];
var clusterUri = new Uri(ConfigurationManager.AppSettings["ClusterURI"]);

var client = new OozieHttpClient(clusterUri, hadoopUser, hadoopPassword);

var prop = new OozieJobProperties(hadoopUser, nameNodeHost,
"jobtrackerhost:9010", workflowDir, inputDir, outputDir);
...

Next, the code populates the Oozie job properties, submits the job, and waits for Hadoop to acknowledge that
the job has been submitted. Then it uses a JsonSerialiser instance to deserialize the result and display the job
ID, before executing the job by calling the StartJob method of the OozieHttpClient instance. This allows the user
to see the job ID that is executing, and monitor the job if required by searching for this ID in the Hadoop
administration website pages.
C#
...
var parameters = prop.ToDictionary();
parameters.Add("oozie.use.system.libpath", "true");
parameters.Add("outputTable", rawTableName);
parameters.Add("stagingTable", stagingTableName);
parameters.Add("stagingLocation", stagingLocation);
parameters.Add("invalidTable", invalidTableName);
parameters.Add("invalidLocation", invalidLocation);
parameters.Add("auditTable", auditTableName);
parameters.Add("auditLocation", auditLocation);
parameters.Add("cityTable", cityTable);
parameters.Add("countryRegionTable", countryRegionTable);
parameters.Add("storeTable", storeTable);
parameters.Add("targetSqlTable", targetSqlTable);
parameters.Add("outputAsvFullPath", nameNodeHost + outputDir);
parameters.Add("connectionString", sqlConnectionString);

var newJob = await client.SubmitJob(parameters);

var content = await newJob.Content.ReadAsStringAsync();

var serializer = new JsonSerializer();
dynamic json = serializer.Deserialize(
new JsonTextReader(new StringReader(content)));
string id = json.id;
Console.WriteLine(id);
await client.StartJob(id);
Console.WriteLine("Job started!");
}

By using this simple client application the administrators at Northwind Traders can initiate the ETL process from
an on-premises client computer with minimal effort. The program could even be scheduled for unattended
execution at a regular interval to create a fully automated ETL process for the daily transaction files.

The complete source code for the application is included in the Hands-on Labs associated with this guide. To
obtain the labs go to http://wag.codeplex.com/releases/view/103405. The files are in the folder named Lab 2 -
Extract Transform Load\Labfiles\ETLWorkflow.

Summary
In this scenario you have seen how Northwind Traders implemented an ETL process that handles the transaction
data submitted by all of its stores each night. The data arrives as XML documents, and must pass through a
workflow that consists of several tasks. These tasks transform the data, validate it, generate the correct store
identifiers, detect fraudulent activity or transactions above a specific value, and then store the results in the
head office database.
You saw how Northwind Traders uses Oozie to implement the workflow. This workflow defines each task, and
executes these in turn. While Oozie is not the only solution for implementing a workflow with HDInsight, it is
integrated with Hadoop in such a way that it is easy to execute scripts and queries in a controllable and easily
configurable way. For an organization that does not use an on-premises database such as SQL Server and
associated technologies such as SQL Server Integration Services (SSIS), Oozie is typically a good choice for
building workflows for use with HDInsight.
As part of the scenario, you saw how the workflow can execute a custom map/reduce operation, create Hive
tables, join tables and select data, validate values in the data, and apply criteria to detect specific conditions
such as sales over a certain value. You also saw how you can use Sqoop to push the results into a Windows
Azure SQL Database instance. Finally, you saw how the classes in the Microsoft .NET SDK for Hadoop make it
possible to automate an Oozie workflow.
If you would like to work through the solution to this scenario yourself, you can download the files and a step-
by-step guide. These are included in the related Hands-on Labs available from
http://wag.codeplex.com/releases/view/103405.



Scenario 3: HDInsight as a Data Warehouse
In this scenario HDInsight is used as a data warehouse for big data analysis. Specifically, this scenario
demonstrates how to:
Create a data warehouse in an HDInsight cluster.
Load data into Hive tables in a data warehouse.
Optimize HiveQL queries.
Use an HDInsight-based data warehouse as a source for data analysis.

The scenario also discusses how you can minimize running costs by shutting down the cluster when its not is
use.
This chapter describes the scenario in general, and the solutions applied to meet the goals
of the scenario. If you would like to work through the solution to this scenario yourself
you can download the files and a step-by-step guide. These are included in the related
Hands-on Labs available from http://wag.codeplex.com/releases/view/103405.
Introduction to A. Datum
The scenario is based on a fictional company named A. Datum, which conducts research into tornadoes and
other weather-related phenomena in the Unites States. In this scenario, data analysts at A. Datum want to use
HDInsight as a central repository for historical tornado data in order to analyze and visualize previous tornados,
and to try to identify trends in terms of geographical locations and times.
Analytical Goals and Data Sources
The data analysts at A. Datum have gathered historical data about tornadoes in the United States since 1934.
The data includes the date and time, geographical start and end point, category, size, number of fatalities and
casualties, and damage costs of each tornado. The goal of the data warehouse is to enable analysts to slice and
dice this data and visualize aggregated values on a map. This will act as a resource for further research into
severe weather patterns, and support further investigation in the future.
The data analysts also implemented a mechanism to continuously collect new data as it is published, and this is
added to the existing data by being uploaded to Windows Azure blob storage. However, as a research
organization, A. Datum does not monitor the data on a daily basis this is the job of the storm warning agencies
in the United States. Instead, A. Datums primary aim is to analyze the data only after significant tornado events
have occurred, or on a quarterly basis to provide updated results data for use in their scientific publications.
Therefore, to minimize running costs, the analysts want to be able to shut down the cluster when it is not being
used for analysis, and recreate it when required.
Creating the Data Warehouse
A data warehouse is fundamentally a database that is optimized for queries that support data analysis and
reporting. Data warehouses are often implemented as relational databases with a schema that optimizes query
performance, and aggregation of important numerical measures, at the expense of some data duplication in
denormalized tables. When creating a data warehouse for Big Data you can use a relational database engine
that is designed to handle huge volumes of data (such as SQL Server Parallel Data Warehouse), or you can load
the data into a Hadoop cluster and use Hive tables to project a schema onto the data.
In this scenario the data analysts at A. Datum have decided to use a Hadoop cluster provided by the Windows
Azure HDInsight service. This enables them to build a Hive-based data warehouse schema that can be used as a
source for analysis and reporting in traditional tools such as Microsoft Excel, but which also will enable them to
apply Big Data analysis tools and techniques. For example, they can carry out iterative data exploration with Pig
and custom map/reduce code, and predictive analysis with Hadoop technologies such as Mahout.
Figure 1 shows a schematic view of the architecture when using HDInsight to implement a data warehouse.


Figure 1
Using HDInsight as a Data Warehouse for analysis, reporting, and as a business data source
Unlike a traditional relational database, HDInsight allows you to manage the lifetime and storage of tables and
indexes (metadata) separately from the data that populates the tables. As you saw in Chapter 3, a Hive table is
simply a definition that is applied over a folder containing data. This separation is what enables one of the
primary differences between Big Data solutions and relational database. In a Big Data solution you apply a
schema when the data is read, rather than when it is written.
In this scenario youll see how the capability to apply a schema on read provides an advantage for
organizations that need a data warehousing capability where data can be continuously collected, but analysis
and reporting is carried out only occasionally.
Creating a Database
When planning the HDInsight data warehouse, the data analysts at A. Datum needed to consider ways to ensure
that the data and Hive tables can be easily recreated in the event of the HDInsight cluster being released and re-
provisioned. This might happen for a number of reasons, including temporarily decommissioning the cluster to
save costs during periods of non-use, and releasing the cluster in order to create a new one with more nodes in
order to scale out the data warehouse. The analysts identified two possible approaches:
Save a HiveQL script that can be used to recreate EXTERNAL tables based on HDFS folders where the
data is persisted in Windows Azure blob storage.
Use a custom Windows Azure SQL Database to host the Hive metadata store.

To test the approaches for recreating a cluster, the analysts provisioned an HDInsight
cluster and an instance of Windows Azure SQL Database, and used the steps described in
the section Retaining Data and Metadata for a Cluster in Chapter 7 of this guide to use
the custom SQL database instance to store Hive metadata instead of the default internal
store.

Using a HiveQL script to recreate tables after releasing and re-provisioning a cluster requires no configuration
effort, and is effective when the data warehouse will contain only a few tables and other objects. Hive tables can
be created as EXTERNAL in custom HDFS folder locations so that the data is not deleted in the event of the
cluster being released, and the saved script can be run to recreate the tables over the existing data when the
cluster is re-provisioned.
The benefit of using a custom SQL Database instance for the Hive metadata store is that, in the event of the
cluster being released, the Hive metastore is not deleted. When the cluster is re-provisioned the analysts can
update the settings in the hive-site.xml file to restore the connection to the custom metastore. Assuming the
tables were created as EXTERNAL tables, the data will remain in place in Windows Azure storage and the data
warehouse schema will be restored.
To ensure a logical separation from Hive tables used for other analytical processes, the data analysts decided to
create a dedicated database for the data warehouse by using the following HiveQL statement:
HiveQL
CREATE DATABASE DataWarehouse
LOCATION '/datawarehouse/database';
This statement creates an HDFS folder named /datawarehouse/database as the default folder for all objects
created in the DataWarehouse database.
Creating a logical database in HDInsight is a useful way to provide separation between the contents of the
database and other items located in the same cluster.
Creating Tables
The tornado data includes the code for the state where the tornado occurred, as well as the date and time of
the tornado. The data analysts want to be able to display the full state name in reports, and so created a table
for state names using the following HiveQL statement.
HiveQL
USE DataWarehouse;
CREATE EXTERNAL TABLE States (StateCode STRING, StateName STRING)
STORED AS SEQUENCEFILE;
Notice that the table is stored in the default location. For the DataWarehouse database this is the
/datawarehouse/database folder, and so this is where a new folder named States is created. The table is
formatted as a Sequence File. Tables in this format typically provide faster performance than tables in which
data is stored as text.

You must use EXTERNAL tables if you want the data to be persisted when you delete a table definition or
when you recreate a cluster. Storing the data in SEQUENCEFILE format is also a good idea as it can improve
performance.
The data analysts also want to be able to aggregate data by temporal hierarchies (year, month, and day) and
create reports that show month and day names. Hive does not support a datetime data type, and provides no
functions to convert dates to constituent parts. While many client applications used to analyze and report data
support this kind of functionality, the data analysts want to be able to generate reports without relying on
specific client application capabilities.
To support date-based hierarchies and reporting, the data analysts created a table containing various date
attributes that can be used as a lookup table for date codes in the tornado data. The creation of a date table like
this is a common pattern in relational data warehouses.
HiveQL
USE DataWarehouse;
CREATE EXTERNAL TABLE Dates
(DateCode STRING,
CalendarDate STRING,
DayOfMonth INT,
MonthOfYear INT,
Year INT,
DayOfWeek INT,
WeekDay STRING,
Month STRING)
STORED AS SEQUENCEFILE;
Finally, the data analysts created a table for the tornado data itself. Since this table is likely to be large, and
many queries will filter by year, they decided to partition the table on a Year column, as shown in the following
HiveQL statement.
HiveQL
USE DataWarehouse;
CREATE EXTERNAL TABLE Tornadoes
(DateCode STRING,
StateCode STRING,
EventTime STRING,
Category INT,
Injuries INT,
Fatalities INT,
PropertyLoss DOUBLE,
CropLoss DOUBLE,
StartLatitude DOUBLE,
StartLongitude DOUBLE,
EndLatitude DOUBLE,
EndLongitude DOUBLE,
LengthMiles DOUBLE,
WidthYards DOUBLE)
PARTITIONED BY (Year INT)
STORED AS SEQUENCEFILE;
Loading Data into the Data Warehouse
The tornado source data is in tab-delimited text format, and the data analysts have used Excel to create lookup
tables for dates and states in the same format. To simplify loading this data into the Sequence File formatted
tables created previously, they decided to create staging tables in Text File format to which the source data will
be uploaded; and then use a HiveQL INSERT statement to load data from the staging tables into the data
warehouse tables. This approach also has the advantage of enabling the data analysts to query the staging
tables and validate the data before loading it into the data warehouse.

We followed standard data warehousing practice by loading the data into staging tables where it can be
validated before being added to the data warehouse.

Creating the Staging Tables
The following HiveQL statements were used to create the staging tables in the default database.
HiveQL
CREATE EXTERNAL TABLE StagedStates
(StateCode STRING,
StateName STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/staging/states';

CREATE EXTERNAL TABLE StagedDates
(DateCode STRING,
CalendarDate STRING,
DayOfMonth INT,
MonthOfYear INT,
Year INT,
DayOfWeek INT,
WeekDay STRING,
Month STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/staging/dates';

CREATE EXTERNAL TABLE StagedTornadoes
(DateCode STRING,
StateCode STRING,
EventTime STRING,
Category INT,
Injuries INT,
Fatalities INT,
PropertyLoss DOUBLE,
CropLoss DOUBLE,
StartLatitude DOUBLE,
StartLongitude DOUBLE,
EndLatitude DOUBLE,
EndLongitude DOUBLE,
LengthMiles DOUBLE,
WidthYards DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/staging/tornadoes';
Loading Static Data
The state data consists of a state code and state name for each US state (plus the District of Columbia and
Puerto Rico), and is unlikely to change. Similarly, the date data consists of a row for each day between January
1
st
1934 and December 31
st
2013. It will be extended to include another years worth of data at the end of each
year, but will otherwise remain unchanged.
These tables can be loaded using a one-time manual process; the source data file for each of the staging tables is
uploaded to the appropriate folder on which the table is based, and the following HiveQL statements are used to
load data from the staging tables into the data warehouse tables.
HiveQL
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

FROM StagedStates s
INSERT INTO TABLE DataWarehouse.States
SELECT s.StateCode, s.StateName;

FROM StagedDates s
INSERT INTO TABLE DataWarehouse.Dates
SELECT s.DateCode, s.CalendarDate, s.DayOfMonth, s.MonthOfYear,
s.Year, s.DayOfWeek, s.WeekDay, s.Month;
Notice that these statements apply compression to the output of the queries, which improves performance and
reduces the time taken to load the tables.
Implementing a Repeatable Load Process
Unlike states and dates, the tornado data will be loaded as frequently as new tornado data becomes available.
This requirement made it desirable to implement a repeatable, automated solution for loading tornado data. To
accomplish this, the data analysts enlisted the help of a software developer to create a utility for loading new
tornado data.
The .NET SDK for Hadoop contains several classes that are very useful for automating a range of
operations related to working with HDInsight.
The developer used Microsoft Visual C# to implement a Windows console application that automates the load
process, leveraging the WebHDFSClient and WebHCatHttpClient classes from the Microsoft .NET SDK for
Hadoop. The application consists of a single Program class containing the following code:
C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Hadoop.WebHDFS;
using Microsoft.Hadoop.WebHDFS.Adapters;
using Microsoft.Hadoop.WebHCat.Protocol;

namespace DWLoader
{
class Program
{
const string cluster_uri = "https://adatum-cluster.azurehdinsight.net:563";
const string hadoop_user = "adatum";
const string hadoop_password = "!Pa$$w0rd!";
const string asv_account = "adatum-storage";
const string asv_key = "1234567890abcdefghijklmnopqrstuvwxyz=";
const string asv_container = "adatum-cluster";

static void Main(string[] args)
{
UploadFile(@"C:\Data\Tornadoes.txt",
@"/staging/tornadoes");
LoadTable().Wait();
Console.WriteLine("\nPress any key to continue.");
Console.ReadKey();
}

static void UploadFile(string source, string destination)
{
var storageAdapter = new BlobStorageAdapter(asv_account, asv_key,
asv_container, true);
var HDFSClient = new WebHDFSClient(hadoop_user, storageAdapter);
Console.WriteLine("Deleting existing files...");
HDFSClient.DeleteDirectory(destination, true).Wait();
HDFSClient.CreateDirectory(destination).Wait();
Console.WriteLine("Uploading new file...");
HDFSClient.CreateFile(source, destination + "/data.txt").Wait();
}

private static async Task LoadTable()
{
var webHCat = new WebHCatHttpClient(new System.Uri(cluster_uri),
hadoop_user, hadoop_password);
webHCat.Timeout = TimeSpan.FromMinutes(5);
Console.WriteLine("Starting data load into data warehouse...");
// IMPORTANT - THE FOLLOWING STATEMENT SHOULD BE ON A SINGLE LINE.
string command =
@"SET mapred.output.compression.type=BLOCK; SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET hive.exec.dynamic.partition.mode=nonstrict; FROM StagedTornadoes s
INSERT INTO TABLE DataWarehouse.Tornadoes PARTITION (Year) SELECT s.DateCode,
s.StateCode, s.EventTime, s.Category, s.Injuries, s.Fatalities, s.PropertyLoss,
s.CropLoss, s.StartLatitude, s.StartLongitude, s.EndLatitude, s.EndLongitude,
s.LengthMiles, s.WidthYards, SUBSTR(s.DateCode, 1, 4) Year;";
var job = await webHCat.CreateHiveJob(command, null, null,
"LoadDWData", null);
var result = await job.Content.ReadAsStringAsync();
Console.WriteLine("Job ID: " + result);
}
}
}

The line defining the command string in the code above must be all on the same line, and not
broken over several lines as weve had to do here due to page width limitations.
This code uses the WebHDFSClient class to delete the /staging/tornadoes directory on which the
StagedTornadoes table is based, and then recreates the directory and uploads the new data file. The code then
uses the WebHCatHttpClient class to start an HCatalog job that uses a HiveQL statement to insert the data from
the staging table into the Tornadoes table in the data warehouse.

You can download the Microsoft .NET SDK For Hadoop from Codeplex or install it with NuGet.
Optimizing Query Performance
Having loaded the data warehouse with data, the data analysts set about exploring how to optimize query
performance. The data warehouse will be used as a source for multiple reporting and analytical activities, so
understanding how to reduce the time it takes to retrieve results from a query is important.
Using Hive Session Settings to Improve Performance
Hive includes a number of session settings that can affect the way a query is processed, so the data analysts
decided to start by investigating the effect of these settings. The exploration started with the baseline query in
the following code sample.

HiveQL
SELECT s.StateName, d.MonthOfYear,
SUM(t.PropertyLoss) PropertyCost, SUM(t.CropLoss) CropCost
FROM DataWarehouse.States s
JOIN DataWarehouse.Tornadoes t ON t.StateCode = s.StateCode
JOIN DataWarehouse.Dates d ON t.DateCode = d.DateCode
WHERE t.Year = 2010
GROUP BY s.StateName, d.MonthOfYear;
The data analysts executed the baseline query and observed how long it took to complete. They then explored
the effect of compressing the map output and intermediate data generated by the map and reduce jobs (which
are created by Hive to process the query) by using the following code.

HiveQL
SET mapred.compress.map.output=true;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET hive.exec.compress.intermediate=true;

SELECT s.StateName, d.MonthOfYear,
SUM(t.PropertyLoss) PropertyCost, SUM(t.CropLoss) CropCost
FROM DataWarehouse.States s
JOIN DataWarehouse.Tornadoes t ON t.StateCode = s.StateCode
JOIN DataWarehouse.Dates d ON t.DateCode = d.DateCode
WHERE t.Year = 2010
GROUP BY s.StateName, d.MonthOfYear;

This revised query resulted in an improved execution time, which improved further as the volume of data in the
tables increased. The analysts concluded that applying compression settings is likely to be beneficial for typical
data warehouse queries involving large volumes of data.

Many optimization techniques in HDInsight have a major impact only when you have large volumes of
data to process; typically tens of Gigabytes. With small datasets you may even see some degradation of
performance from typical optimization techniques.
When examining the query logs for the baseline query, the analysts noted that the job used a single reducer.
They decided to investigate the effect of using multiple reducers by explicitly setting the number, as shown in
the following code.

HiveQL
SET mapred.reduce.tasks=2;

SELECT s.StateName, d.MonthOfYear,
SUM(t.PropertyLoss) PropertyCost, SUM(t.CropLoss) CropCost
FROM DataWarehouse.States s
JOIN DataWarehouse.Tornadoes t ON t.StateCode = s.StateCode
JOIN DataWarehouse.Dates d ON t.DateCode = d.DateCode
WHERE t.Year = 2010
GROUP BY s.StateName, d.MonthOfYear;

An alternative approach to controlling the number of reducers would be to enforce a limit on
the number of bytes allocated to a single reducer by setting the
hive.exec.reducers.bytes.per.reducer property.

With small volumes of data, and clusters with only a few nodes, this approach offered no reduction in execution
times; and in some cases was detrimental to query performance. However, with larger volumes of data and
multi-node clusters the analysts found that forcing the use of multiple reducers could sometimes improve
performance.
The baseline query includes two joins in which a large table (Tornadoes) is joined to two smaller tables (States
and Dates). Hive includes a built-in optimization named map-joins, which is designed to improve query
performance in exactly this scenario. To force a map-join the analysts modified the query as shown in the
following code sample.

HiveQL
SET hive.auto.convert.join=true;
SET hive.smalltable.filesize=25000000;

SELECT s.StateName, d.MonthOfYear,
SUM(t.PropertyLoss) PropertyCost, SUM(t.CropLoss) CropCost
FROM DataWarehouse.Tornadoes t
JOIN DataWarehouse.Dates d ON t.DateCode = d.DateCode
JOIN DataWarehouse.States s ON t.StateCode = s.StateCode
WHERE t.Year = 2010
GROUP BY s.StateName, d.MonthOfYear;

Map-join optimization works by loading the smaller table into memory, reducing the time taken to perform
lookups when joining keys from the larger table. In practice, the analysts found that the technique improved
performance significantly when working with extremely large volumes of data and queries that typically take
tens of minutes to execute, but the effect was less noticeable with small to medium tables where the execution
time is already within minutes or seconds.

For more information about using Hive settings to optimize query performance, see Denny Lees blog article
Optimizing Joins running on HDInsight Hive on Azure at GFS.

Using Indexes
Indexes are a common way to improve query performance in relational databases, and Hive provides extensible
support for indexing Hive tables in an HDInsight cluster. The data analysts therefore decided to investigate the
effect of indexes on query performance. The baseline query for this investigation is shown in the following code
sample.
HiveQL
SELECT s.StateName,SUM(t.PropertyLoss) PropertyCost, SUM(t.CropLoss) CropCost
FROM DataWarehouse.States s
JOIN DataWarehouse.Tornadoes t ON t.StateCode = s.StateCode
JOIN DataWarehouse.Dates d ON t.DateCode = d.DateCode
WHERE d.MonthOfYear = 6 AND t.Year = 2010
GROUP BY s.StateName;

The Tornadoes table is partitioned on the Year column, which should reduce the time taken for queries that
filter on year. However, the baseline query also filters on month, which is represented by the MonthOfYear
column in the Dates table. The data analysts used the following HiveQL statements to create and build an index
on this column.
HiveQL
USE DataWarehouse;

CREATE INDEX idx_MonthOfYear
ON TABLE Dates (MonthOfYear)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;

ALTER INDEX idx_MonthOfYear ON Dates REBUILD;

After the index had been built, the baseline query was re-executed to determine the effect of the index. As with
other optimization techniques discussed previously the effect was most noticeable when querying large volumes
of data. When the table contained a small amount of data the index had little or no effect, and in some cases
proved detrimental to performance.

For more information about optimizing query and transformation performance see the section Tuning
the Solution in Chapter 4 of this guide.

Analyzing Data from the Data Warehouse
Having built and populated the data warehouse and investigated techniques for improving query performance,
the data analysts were ready to start analyzing the data in the data warehouse tables.
Analyzing Data in Excel
The data analysts typically use Excel for analysis, so the first task they undertook was to create a PowerPivot
data model in an Excel workbook. The data model uses the Hive ODBC driver to import data from the data
warehouse tables in the HDInsight cluster and defines relationships and hierarchies that can be used to
aggregate the data, as shown in Figure 2.


Figure 2
A PowerPivot data model based on a Hive data warehouse

The business analysts could then use the data model to explore the data by creating a PivotTable showing the
total cost of property and crop damage by state and time period, as shown in Figure 3.


Figure 3
Analyzing the data with a PivotTable
Analyzing Data with GeoFlow
The data includes geographic locations as well as the date and time of each tornado, making it ideal for
visualization using the GeoFlow add-in for Excel. GeoFlow enables analysts to create interactive tours that show
summarized data on a map, and animate it based on a timeline.
The analysts created a GeoFlow tour that consist of a single scene with the following two layers:
Layer 1: Accumulating property and crop damage costs shown as a stacked column chart by state.
Layer 2: Tornado category shown as a heat map by latitude and longitude.
The GeoFlow designer for the tour is shown in Figure 4.


Figure 4
Designing a GeoFlow tour

GeoFlow is a great tool for generating interactive and highly immersive presentations of data, especially
where the data has a date or time component that allows you to generate a changes over time animation.

The scene in the GeoFlow tour was then configured to animate movement across the map as the timeline
progresses. While the tour runs, the property and crop losses are displayed as columns that accumulate each
month, and the tornados are displayed as heat maps based on the tornado category (defined in the Enhanced
Fujita scale), as shown in Figure 5.

Figure 5
Viewing a GeoFlow tour
After the tour has finished the analysts can explore the final costs incurred in each state by zooming and moving
around the map viewer.
Summary
In this scenario you have seen how A. Datum created a data warehouse in an HDInsight cluster by using a Hive
database and tables. You have also seen how the analysts loaded data into the data warehouse and used various
techniques to optimize query performance before consuming the data in the data warehouse from an analytical
tool such as Excel.
If you would like to work through the solution to this scenario yourself, you can download the files and a step-
by-step guide. These are included in the related Hands-on Labs available from
http://wag.codeplex.com/releases/view/103405.

Scenario 4: Enterprise BI Integration
This scenario explores ways in which Big Data processing with HDInsight can be integrated into an enterprise
business intelligence (BI) solution. The emphasis in this scenario is on the challenges and techniques associated
integrating data from HDInsight into a BI ecosystem based on Microsoft SQL Server and Office technologies;
including integration at the report, corporate data model, and data warehouse layers of the BI solution.
The data ingestion and processing elements of the example used in this scenario have been deliberately kept
simple in order to focus on the integration techniques. In a real-world solution the challenge of obtaining the
source data, loading it to the HDInsight cluster, and using Map/Reduce code, Pig, or Hive to process it before
consuming it in a BI infrastructure are likely to be more complex than described in this scenario. For detailed
guidance about ingesting and processing data with HDInsight, see the main chapters and other scenarios in this
guide.

This chapter describes the scenario in general and the solutions applied to meet the goals of the
scenario. If you would like to work through the solution to this scenario yourself you can download
the files and a step-by-step guide. These are included in the related Hands-on Labs available from
http://wag.codeplex.com/releases/view/103405.
Introduction to Adventure Works
Adventure Works is a fictional company that manufactures and sells bicycles and cycling equipment. Adventure
Works sells its products through an international network of resellers and also through its own e-commerce site,
which is hosted in an Internet Information Services (IIS)-based datacenter.
As a large-scale multinational company, Adventure Works has already made a considerable investment in data
and BI technologies to support formal corporate reporting, as well as for business analysis.
The Existing Enterprise BI Solution
The BI solution at Adventure Works is built on an enterprise data warehouse hosted in SQL Server Enterprise
Edition. SQL Server Integration Services (SSIS) is used to refresh the data warehouse with new and updated data
from line of business systems. This includes sales transactions, financial accounts, and customer profile data.

The high-level architecture of the Adventure Works BI solution is shown in Figure 1.


Figure 1
The Adventure Works enterprise BI solution

The enterprise data warehouse is based on a dimensional design in which multiple dimensions of the business
are conformed across aggregated measures. The dimensions are implemented as dimension tables, and the
measures are stored in fact tables at the lowest common level of grain (or granularity).

A partial schema of the data warehouse is shown in Figure 2.


Figure 2
Partial data warehouse schema

Figure 2 shows only a subset of the data warehouse schema. Some tables and
columns have been omitted for clarity.


The data warehouse supports a corporate data model, which is implemented as a SQL Server Analysis Services
(SSAS) tabular database. This data model provides a cube for analysis in Microsoft Excel and reporting in SQL
Server Reporting Services (SSRS), and is shown in Figure 3.


Figure 3
An SSAS tabular data model


The SSAS data model supports corporate reporting, including a sales performance dashboard that shows a
scorecard of key performance indicators (KPIs), as shown in Figure 4. Additionally, the data model is used by
Adventure Works business users to create PivotTables and PivotCharts in Microsoft Excel.


Figure 4
An SSRS report based on the SSAS data model

We want to empower our users by giving them access to corporate information that can help them to
make better business decisions. For this reason we implement a self-service approach by exposing interesting
data at the corporate data model level.
In addition to the managed reporting and analysis supported by the corporate data model, Adventure Works has
empowered business analysts to engage in self-service BI activities, including the creation of SSRS reports with
Report Builder and the use of PowerPivot in Excel to create personal data models directly from tables in the data
warehouse, as shown in Figure 5.


Figure 5
Creating a personal data model with PowerPivot

If you are not familiar with the terminology and design of enterprise BI systems, you will find
the overview in Appendix B of this guide provides useful background information.

The Analytical Goals
The existing BI solution provides comprehensive reporting and analysis of important business data. However,
the IIS web server farm on which the e-commerce site is hosted generates log files, which are retained and used
for troubleshooting purposes but until now have never been considered a viable source of business information.
The web servers generate a new log file each day, and each log contains details of every request received and
processed by the web site. This provides a huge volume of data that could potentially provide useful insights
into customers activity on the e-commerce site.
A small subset of the log file data is shown in the following example.
Log file contents
#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2008-01-01 09:15:23
#Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc-
status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer)
2008-01-01 00:01:00 198.51.100.2 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100
2008-01-01 00:04:00 198.51.100.5 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100
Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com
2008-01-01 00:05:00 198.51.100.6 - 192.0.0.1 80 GET /product.aspx productid=BL-2036
200 1000 1000 100
Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com
2008-01-01 00:06:00 198.51.100.7 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100
Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com
/product.aspx productid=CN-6137 200 1000 1000 100
Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com

The ability to analyze the log data and summarize website activity over time would help the business measure
the amount of data transferred during web requests, and potentially correlate web activity with sales
transactions to better understand trends and patterns in e-commerce sales. However the large volume of log
data that would need to be processed in order to extract these insights has prevented the company from
attempting to include the log data in the enterprise data warehouse. The company has recently decided to use
HDInsight to process and summarize the log data so that it can be reduced to a more manageable volume, and
integrated into the enterprise BI ecosystem.

Of course, you also need to consider how you will access the log files and load them into HDInsight,
especially if they are distributed across a server farm in another domain or an off-premises location. The section
Loading Web Server Log Files in Chapter 3 of this guide explores some of the approaches you might follow.

The HDInsight Solution
The HDInsight solution is based on a Windows Azure HDInsight cluster, enabling Adventure Works to
dynamically increase or reduce cluster resources as required. The IIS log files are uploaded to Windows Azure
blob storage each day using a custom file upload utility. The files are then retained in blob storage to support
data processing with HDInsight.

For information and guidance on uploading data to HDInsight, and some of the ways you can
maximize performance by compressing the data and combining or splitting files, see Chapter
3 of this guide.

Creating the Hive Tables
To support the goal for summarization of the data, the BI developer intends to create a Hive table that can be
queried from client applications such as Excel. However, the raw source data includes header rows that must be
excluded from the analysis, and the tab-delimited text format of the source data will not provide optimal query
performance as the volume of data grows.
The developer therefore decided to create a temporary staging table based on the uploaded data, and use the
staging table as a source for a query that loads the required data into a permanent table that is optimized for
the typical analytical queries that will be used. The following HiveQL statement has been used to define the
staging table over the folder containing the log files:

HiveQL
CREATE TABLE log_staging
(logdate STRING,
logtime STRING,
c_ip STRING,
cs_username STRING,
s_ip STRING,
s_port STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
sc_status STRING,
sc_bytes INT,
cs_bytes INT,
time_taken INT,
cs_User_Agent STRING,
cs_Referrer STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '32'
STORED AS TEXTFILE LOCATION '/data';
This Hive table defines a schema for the log file, making it possible to use a query that filters the rows in order to
load the required data into a permanent table for analysis. Notice that the staging table is based on the /data
folder but it is not defined as EXTERNAL, so dropping the staging table after the required rows have been loaded
into the permanent table will delete the source files that are no longer required.
When designing the permanent table for analytical queries, the BI developer has decided to partition the data
by year and month to improve query performance when extracting data. To achieve this, a second Hive
statement is used to define a partitioned table. Notice the PARTITIONED BY clause near the end of the following
script. This instructs Hive to add two columns named year and month to the table, and to partition the data
loaded into the table based on the values inserted into these columns.

HiveQL
CREATE TABLE iis_log
(logdate STRING,
logtime STRING,
c_ip STRING,
cs_username STRING,
s_ip STRING,
s_port STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
sc_status STRING,
sc_bytes INT,
cs_bytes INT,
time_taken INT,
cs_User_Agent STRING,
cs_Referrer STRING)
PARTITIONED BY (year INT, month INT)
STORED AS SEQUENCEFILE;

Next, a Hive script is used to load the data into the new iis_log Hive table. This script takes the values from the
columns in the log_staging Hive table, calculates the values for the year and month of each row, and inserts the
rows into the partitioned iis_log Hive table. Recall also that the source log data includes a number of header
rows that are prefixed with a # character, which could cause errors or add unnecessary complexity to
summarizing the data. To resolve this, the HiveQL statement includes a WHERE clause that ignores rows starting
with # so that they are not loaded into the permanent table.

HiveQL
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=
org.apache.hadoop.io.compress.GzipCodec;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM log_staging s
INSERT INTO TABLE iis_log PARTITION (year, month)
SELECT s.logdate, s.logtime, s.c_ip, s.cs_username, s.s_ip,
s.s_port, s.cs_method, s.cs_uri_stem, s.cs_uri_query,
s.sc_status, s.sc_bytes, s.cs_bytes, s.time_taken,
s.cs_User_Agent, s.cs_Referrer,
SUBSTR(s.logdate, 1, 4) year, SUBSTR(s.logdate, 6, 2) month
WHERE SUBSTR(s.logdate, 1, 1) <> '#';

To maximize performance the script includes statements that specify the output from the query should be
compressed. For details of how you can use the Hadoop configuration settings in this way see the section
Tuning the Solution in Chapter 4 of this guide. The code also sets the dynamic partition mode to nonstrict,
enabling rows to be inserted into the appropriate partitions dynamically based on the values of the partition
columns.

Indexing the Hive Table
To further improve performance the BI developer at Adventure Works decided to index the iis_log table on the
date, because most queries will need to select and sort the results by date. The following Hive QL statement
creates an index on the logdate column of the table.
HiveQL
CREATE INDEX idx_logdate
ON TABLE iis_log (logdate)
AS org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;
When data is added to the iis_log table the index can be updated using the following Hive QL statement.
HiveQL
ALTER INDEX idx_logdate ON iis_log REBUILD;



Our tests revealed that indexing the tables provided only a small improvement in performance of queries,
and that building the index took longer than the time we saved running the query. However, the results depend
on factors such as the volume of source data, and so you should experiment to see if indexing is a useful
optimization technique in each scenario.

Returning the Required Data
The data in HDInsight is now ready for use in the corporate BI system and can be consumed by querying the
iis_log table. For example, the following script creates a view containing all of the data for all years and months.
HiveQL
SELECT * FROM iis_log

Where a more restricted dataset is required the BI developer or business user can use a script that selects on the
year and month columns, and transforms the data as required. For example, the following script extracts just
the data for the first quarter of 2012, aggregates the number of hits for each day (the logdate column contains
the date in the form yyyy-mm-dd), and returns a dataset with two columns: the date and the total number of
page hits.
HiveQL
SELECT logdate, COUNT(*) pagehits
FROM iis_log
WHERE year = 2012 AND month < 4
GROUP BY logdate
Report-Level Integration
Now that the log data is encapsulated in a Hive table, HiveQL queries can be used to summarize the log entries
and extract aggregated values for reporting and analysis. Initially, business analysts at Adventure Works want to
try to find a relationship between the number of website hits (obtained from the web server log data in
HDInsight) and the number of items sold (available from the enterprise data warehouse). Since only the business
analysts require this combined data, there is no need at this stage to integrate the web log data from HDInsight
into the entire enterprise BI solution. Instead, a business analyst can use PowerPivot create a personal data
model in Excel specifically for this mash up analysis.
Adding Data from a Hive Table to a PowerPivot Model
The business analyst starts with the PowerPivot model shown previously in Figure 5, which includes a Date table
and an Internet Sales table based on the DimDate and FactInternetSales tables in the data warehouse. To add
IIS log data from HDInsight the business analyst uses an ODBC connection based on the Hive ODBC driver, as
shown in Figure 6.

The ODBC connection to the HDInsight cluster is typically defined in a data source name (DSN) on the local
computer, which makes it easy to define a connection for programs that will access data in the cluster. The DSN
encapsulates a connection string such as this:
Driver={Hive};DATABASE=default;AUTHENTICATION=3;UID=Username;PWD=Password;
HOST=awhdinsight.cloudapp.net;PORT=563;FRAMED=0;BINARY_FETCH=0;BATCH_SIZE=65536


Figure 6
Creating an ODBC connection to Hive on HDInsight

After the connection has been defined and tested, the business analyst uses the following HiveQL query to
create a new table named Page Hits that contains aggregated log data from HDInsight.
HiveQL
SELECT logdate, COUNT(*) hits
FROM iis_log
GROUP BY logdate

This query returns a single row for each distinct date for which log entries exist, along with a count of the
number of page hits that were recorded on that date. The logdate values in the underlying Hive table are
defined as text, but the yyyy-mm-dd format of the text values means that the business analyst can simply
change the data type for the column in the PowerPivot table to Date; making it possible to create a relationship
that joins the logdate column in the Page Hits table to the Date column in the Date table, as shown in Figure 7.


Figure 7
Creating a relationship in PowerPivot


When you are designing Hive queries that return tables you want to integrate with your BI system, you
should plan ahead by considering the appropriate format for columns that contain key values you will use in the
relationships with existing tables.

With the log data imported into the PowerPivot model, and the relationship between the Page Hits and date
tables in place, the business analyst can now use Excel to analyze the data and try to correlate web page site
activity in the form of page hits with sales transactions in terms of the number of units sold. Figure 8 shows how
the business analyst can use Power View in Excel to visually compare page hits and sales over a six month period
and by quarter, determine weekly patterns of website activity, and filter the visualizations to show comparisons
for a specific weekday.


Figure 8
Using Power View in Excel to analyze data from HDInsight and the data warehouse

By integrating IIS log data from HDInsight with enterprise BI data at the report level, business analysts can create
mash-up reports and analyses without impacting the BI infrastructure used for corporate reporting. However,
after using this report-level integration to explore the possibilities of using IIS log data to increase understanding
of the business, it has become apparent that the log data could be useful to a wider audience of users than just
business analysts, and for a wider range of business processes.
Corporate Data Model Integration
Senior executives of the e-commerce division want to expose the newly discovered information from the web
server log files to a wider audience, and use it in BI-focused business processes. Specifically, they want to include
the ratio of web page hits to sales as a core metric that can be included as a key performance indicator (KPI) for
the business in a scorecard. The goal is to achieve a hit to sale ratio of around 3% (in other words, three items
are sold for every hundred web page hits).
Scorecards and dashboards for Adventure Works are currently based on the SSAS corporate data model, which
is also used to support formal reports and analytical business processes. The corporate data model is
implemented as an SSAS database in tabular mode, so the process to add a table for the IIS log data is similar to
the one used to import the results of a HiveQL query into a PowerPivot model. A BI developer uses SQL Server
Data Tools to add an ODBC connection to the HDInsight cluster and create a new table named Page Hits based
on the following query. Notice that this query includes more columns than the one used previously in the
personal data model, making it useful for more kinds of analysis by a wider audience.

HiveQL
SELECT logdate, SUM(sc_bytes) sc_bytes, SUM(cs_bytes)
cs_bytes, COUNT(*) pagehits
FROM iis_log
GROUP BY logdate

The fact that we are using SSAS in tabular mode makes it possible to connect to an ODBC source such as
Hive. If we had been using SSAS in multidimensional mode we would have had to either extract the data from
HDInsight into an OLE DB compliant data source, or base the data model on a SQL Server database that has a
remote server connection over ODBC to the Hive tables.

After the Page Hits table has been created and the data imported into the model, the data type of the logdate
column is changed to Date and a relationship is created with the Date table in the same way as in the
PowerPivot data model discussed earlier. However, one significant difference between PowerPivot models and
SSAS tabular models is that SSAS does not create implicit aggregated measures from numeric columns in the
same way as PowerPivot does.
The Page Hits table contains a row for each date, with the total bytes sent and received, and the total number of
page hits for that date. The BI developer created explicit measures to aggregate the sc_bytes, cs_bytes, and
pagehits values across multiple dates, based on the data analysis expression (DAX) formulae shown in Figure 9.

Figure 9
Creating measures in an SSAS tabular data model

These measures include DAX expressions that calculate the sum of sc_bytes as a measure named Bytes Sent,
the sum of cs_bytes as a measure named Bytes Received, and the sum of page hits as a measure named Hits.
Additionally, to support the requirement to track page hits to sales performance against a target of 3%, the
following measures have been added to the Internet Sales table.
DAX
Actual Units:=SUM([Order Quantity])
Target Units:=([Hits]*0.03)

A KPI is then defined on the Actual Units measure, as shown in Figure 10.


Figure 10
Defining a KPI in a tabular data model

When the changes to the data model are complete, it is deployed to an SSAS server and fully processed with the
latest data. It can then be used as a data source for reports and analytical tools. For example, Figure 11 shows
that the dashboard report used to summarize business performance has been modified to include a scorecard
for the hits to sales ratio KPI that was added to the data model.

Figure 11
A report showing a KPI from a corporate data model

Dashboards are a great way to present high-level views of information, and are often appreciated most
by business managers because they provide an easy way to keep track of performance of multiple sectors across
the organization. By including the ability to drill down into the information, you make the dashboard even more
valuable to these types of users.

Additionally, because the data from HDInsight has now been integrated at the corporate data model level, it is
more readily available to a wider range of users who may not have the necessary skills and experience to create
their own data models or reports from an HDInsight source. For example, Figure 12 shows how a network
administrator in the IT department can use the corporate data model as a source for a PivotChart in Excel
showing data transfer information for the e-commerce site.

Figure 12
Using a corporate data model in Excel
Data Warehouse Integration
The IIS log entries include the query string passed to the web page, which for pages that display information
about a product includes a productid parameter with a value that matches the relevant product code. Business
analysts at Adventure Works would like to use this information to analyze page hits for individual products, and
include this information in managed and self-service reports.
Details of products are already stored in the data warehouse, but since product is implemented as a slowly
changing dimension, the product records in the data warehouse are uniquely identified by a surrogate key and
not by the original product code. Although the product code referenced in the query string is retained as an
alternate key, it is not guaranteed to be unique if details of the product have changed over time.
The most problematic task for achieving integration with BI systems at data warehouse level is typically
matching the keys in the source data with the correct surrogate key in the data warehouse tables where changes
to the existing data over time prompt the use of an alternate key. Appendix A of this guide explores this issue in
more detail.

The requirement to match the product code to the surrogate key for the appropriate version of the product
makes integration at the report or corporate data model levels problematic. It is possible to perform complex
lookups to find the appropriate surrogate key for an alternate key at the time a specific page hit occurred at any
level (assuming both the surrogate and alternate keys for the products are included in the data model or report
dataset). However, it is more practical to integrate the IIS log data into the dimensional model of the data
warehouse so that the relationship with the product dimension (and the date dimension) is present throughout
the entire enterprise BI stack.
Changes to the Data Warehouse Schema
To integrate the IIS log data into the dimensional model of the data warehouse, the BI developer creates a new
fact table in the data warehouse for the IIS log data, as shown in the following Transact-SQL statement.
Transact-SQL
CREATE TABLE [dbo].[FactIISLog](
[LogDateKey] [int] NOT NULL
REFERENCES DimDate(DateKey),
[ProductKey] [int] NOT NULL
REFERENCES DimProduct(ProductKey),
[BytesSent] decimal NULL,
[BytesReceived] decimal NULL,
[PageHits] integer NULL);
The table will be loaded with new log data on a regular schedule as part of the ETL process for the data
warehouse. In common with most data warehouse ETL processes, the solution at Adventure Works makes use of
staging tables as an interim store for new data, making it easier to coordinate data loads into multiple tables and
perform lookups for surrogate key values. A staging table is created in a separate staging schema using the
following Transact-SQL statement:
Transact-SQL
CREATE TABLE staging.IISLog(
[LogDate] nvarchar(50) NOT NULL,
[ProductID] nvarchar(50) NOT NULL,
[BytesSent] decimal NULL,
[BytesReceived] decimal NULL,
[PageHits] int NULL);

Good practice when regularly loading data into a data warehouse is to minimize the amount
of data extracted from each data source so that only data that has been inserted or modified
since the last refresh cycle is included. This minimizes extraction and load times, and reduces
the impact of the ETL process on network bandwidth and storage utilization. There are many
common techniques you can use to restrict extractions to only modified data, and some data
sources support change tracking or change data capture (CDC) capabilities to simplify this.
In the absence of support in Hive tables for restricting extractions to only modified data, the BI developers at
Adventure Works have decided to use a common pattern that is often referred to as a high water mark
technique. In this pattern the highest log date value that has been loaded into the data warehouse is recorded,
and uses as a filter boundary for the next extraction. To facilitate this, the following Transact-SQL statement is
used to create an extraction log table and initialize it with a default value:
Transact-SQL
CREATE TABLE staging.highwater(
[ExtractDate] datetime DEFAULT GETDATE(),
[HighValue] nvarchar(200));

INSERT INTO staging.highwater (HighValue)
VALUES ('0000-00-00');
Extracting Data from HDInsight
After the modifications to the data warehouse schema and staging area have been made, a solution to extract
the IIS log data from HDInsight can be designed. There are many ways to transfer data from HDInsight to a SQL
Server database, including:
Use Sqoop to export the data from HDInsight and push it to SQL Server.
Use PolyBase to synchronize data in an HDInsight cluster with a SQL Server PDW database.
Use SSIS to extract the data from HDInsight and load it into SQL Server.

PolyBase is available only as part of the SQL Server Parallel Data Warehouse (PDW)
appliance. If you are not using PDW you will need to investigate using Sqoop, SSIS, or an
alternative solution.
In the case of the Adventure Works scenario, the data warehouse is hosted on an on-premises server that
cannot be accessed from outside the corporate firewall. Since the HDInsight cluster is hosted externally in
Windows Azure, the use of Sqoop to push the data to SQL Server is not a viable option. In addition, the SQL
Server instance used to host the data warehouse is running SQL Server 2012 Enterprise Edition, not SQL Server
Parallel Data Warehouse (PDW). Polybase is only available for PDW, and so it cannot be used in this scenario.
The most appropriate option, therefore, is to use SSIS to implement a package that transfers data from
HDInsight to the staging table. Since SSIS is already used as the ETL platform for loading data from other
business sources to the data warehouse, this option also reduces the development and management challenges
for implementing an ETL solution to extract data from HDInsight.
The control flow for an SSIS package to extract the IIS log data from HDInsight is shown in Figure 13.

Figure 13
SSIS control flow for HDInsight data extraction
This control flow performs the following sequence of tasks:
1. An Execute SQL task extracts the HighValue value from the staging.highwater table and stores it in a
variable named HighWaterMark.
2. An Execute SQL task truncates the staging.IISLog table, removing any rows left over from the last time
the extraction process was executed.
3. A Data Flow task transfers the data from HDInsight to the staging table.
4. An Execute SQL task updates the HighValue value in staging.highwater with the highest log date value
that has been extracted.

This data flow task in step 3 extracts the data from HDInsight and performs any necessary data type validation
and conversion before loading it into a staging table, as shown in Figure 14.


Figure 14
SSIS data flow to extract data from HDInsight

The data flow consists of the following sequence of components:
1. An ODBC Source that extracts data from the iis_log Hive table in the HDInsight cluster.
2. A Data Conversion transformation that ensures data type compatibility and prevents field truncation
issues.
3. An OLE DB Destination that loads the extracted data into the staging.IISLog table.


If there is a substantial volume of data in the Hive table the extraction might take a long
time, so it might be necessary to set the timeout on the IIS source component to a high
value. Alternatively, the refresh of data into the data warehouse could be carried our
more often, such as every week. You could create a separate Hive table for the extraction
and automate running a HiveQL statement every day to copy the log entries for that day
to the extraction table after theyve been uploaded into the iis_page_hits table. At the end
of the week you perform the extraction from the (smaller) extraction table, and then
empty it ready for next weeks data.

Every SSIS Data Flow includes an Expressions property that can be used to dynamically assign values to
properties of the data flow or the components it contains, as shown in Figure 15.


Figure 15
Assigning a property expression
In this data flow, the SqlCommand property of the Hive Table ODBC source component is assigned using the
following expression.
SSIS Expression
"SELECT logdate,
REGEXP_REPLACE(cs_uri_query, 'productid=', '') productid,
SUM(sc_bytes) sc_bytes,
SUM(cs_bytes) cs_bytes,
COUNT(*) PageHits
FROM iis_log
WHERE logdate > '" + @[User::HighWaterMark] + "'
GROUP BY logdate,
REGEXP_REPLACE(cs_uri_query, 'productid=', '')"
This expression consists of a HiveQL query to extract the required data, combined with the value of the
HighWaterMark SSIS variable to filter the data being extracted so that only rows with a logdate value greater
than the highest ones already in the data warehouse are included.
Based on the log files for the Adventure Works e-commerce site, the cs_uri_query values in the web server log
file contain either the value - (for requests with no query string) or productid=product-code (where product-
code is the product code for the requested product). The HiveQL query includes a regular expression that parses
the cs_uri_query value and removes the text productid=. The Hive query therefore generates a results set that
includes a productid column, which contains either the product code value or -.

The query string example in this scenario is deliberately simplistic in order to reduce complexity. In a real-
world solution, parsing query strings in a web server log may require a significantly more complex expression,
and may even require a user-defined Java function.
After the data has been extracted it flows to the Data Type Conversion transformation, which converts the
logdate and productid values to 50-character Unicode strings. The rows then flow to the Staging Table
destination, which loads them into the staging.IISLog table.

The document Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS) contains a wealth of
useful information about using SSIS with HDInsight.

Loading Staged Data into a Fact Table
After the data has been staged, a second SSIS package is used to load it into the fact table in the data
warehouse. This package consists of a control flow that contains a single Execute SQL task to execute the
following Transact-SQL code.
Transact-SQL
INSERT INTO dbo.FactIISLog
(LogDateKey,
ProductKey,
BytesSent,
BytesReceived,
PageHits)
SELECT d.DateKey,
p.ProductKey,
s.BytesSent,
s.BytesReceived,
s.PageHits
FROM staging.IISLog s
JOIN DimDate d ON s.LogDate = d.FullDateAlternateKey
JOIN DimProduct p ON s.productid = p.ProductAlternateKey
AND (p.StartDate <= s.LogDate AND (p.EndDate IS NULL
OR p.EndDate > s.LogDate))
ORDER BY d.DateKey;
The code inserts rows from the staging.IISLog table into the dbo.FactIISLog table, looking up the appropriate
dimension keys for the date and product dimensions. The surrogate key for the date dimension is an integer
value derived from the year, month, and day. The LogDateKey value extracted from HDInsight is a string in the
format YYYY-MM-DD. SQL Server can implicitly convert values in this format to the Date data type, so a join
can be made to the FullDateAlternateKey column in the DimDate table to find the appropriate surrogate
DateKey value. The DimProduct dimension table includes a row for None (with the ProductAlternateKey
value -), and one or more rows for each product.
Each product row has a unique ProductKey value (the surrogate key) and an alternate key that matches the
product code extracted by the Hive query. However, because product is a slowly changing dimension there may
be multiple rows with the same ProductAlternateKey value, each representing the same product at a different
point in time. When loading the product data, the appropriate surrogate key for the version of the product that
was current when the web page was requested must be looked up based on the alternate key and the start and
end date values associated with the dimension record, so the join for the DimProduct table in the Transact-SQL
code includes a clause to check for a StartDate value that is before the log date, and an EndDate value that is
either after the log date or null (for the record representing the current version of the product member).

For more details about the use of surrogate keys and slowly changing dimensions in a data
warehouse, refer to Appendix B.
Using the Data
Now that the IIS log data has been summarized by HDInsight and loaded into a fact table in the data warehouse,
it can be used in corporate data models and reports throughout the organization. The data has been conformed
to the dimensional model of the data warehouse, and this deep integration enables business users to intuitively
aggregate IIS activity across dates and products. For example, Figure 16 shows how a user can create a Power
View visualization in Excel that includes sales and page view information for product categories and individual
products.


Figure 16
A Power View report showing data that is integrated in the data warehouse

The inclusion of IIS server log data from HDInsight in the data warehouse enables it to be used easily throughout
the entire BI ecosystem, in managed corporate reports and self-service BI scenarios. For example, a business
user can use Report Builder to create a report that includes web site activity as well as sales revenue from a
single dataset, as shown in Figure 17.


Figure 17
Creating a self-service report with Report Builder

When published to an SSRS report server, this self-service report provides an integrated view of business data,
as shown in Figure 18.

Figure 18
A self-service report based on a data warehouse that contains data from HDInsight
Summary
This scenario explored some of the techniques available to integrate Big Data processing in HDInsight with an
enterprise BI solution. The specific integration techniques that are appropriate for your organization may vary
from those shown in this scenario, but in general you should seek to integrate Big Data into your enterprise BI
infrastructure at the most appropriate layer in the BI architecture based on your requirements for audience
reach, integration into existing business processes, support for self-service reporting, and data manageability.
For more guidance about integrating big data into enterprise BI, see Chapter 2 and Appendix B of this guide. If
you would like to work through the solution to this scenario yourself, you can download the files and a step-by-
step guide. These are included in the related Hands-on Labs available from
http://wag.codeplex.com/releases/view/103405.


Appendix A: Data Ingestion Patterns using
the Twitter API
This appendix demonstrates the three main patterns described in Chapter 3 of the guide for loading data into an
HDInsight cluster:
Interactive data ingestion. This approach is typically used for small volumes of data and when
experimenting with datasets.
Automated batch upload. If you decide to integrate an HDInsight query process with your existing
systems you will typically create an automated mechanism for uploading the data into HDInsight.
Real-time data stream capture. If you need to process data that arrives as a stream you must
implement a mechanism that loads this into HDInsight.

The examples in this appendix are related to the Interactive Exploration scenario described in Chapter 8 of the
guide, and in the associated Hands-on Lab. They demonstrate how the BI team at the fictitious company Blue
Yonder Airlines implemented solutions based on the three ingestion patterns to obtain lists of tweets by using
the Twitter API, and load then into Windows Azure blob storage for use in HDInsight queries and for subsequent
analysis.
About the Twitter API
Twitter provides a number of public APIs that you can use to obtain data for analysis.
The Twitter REST-based search API provides a simple way to search for tweets that contain specific query terms.
To return a collection of tweets that contain a search phrase, you simply specify a parameter named q with the
appropriate search string, and optionally an rpp parameter to specify the number of tweets to be returned per
page. For example, the following request returns up to 100 tweets containing the term @BlueYonderAirlines
in Atom format:
http://search.twitter.com/search.atom?q=@BlueYonderAirlines&rpp=100
You can also request the results in RSS format, as shown by the following request:
http://search.twitter.com/search.rss?q=@BlueYonderAirlines &rpp=100
You can restrict the range of results by including since or until parameters in the query term. For example, the
following request returns tweets from March 1
st
2013 onward:
http://search.twitter.com/search.rss?q=@BlueYonderAirlines since:2013-03-01&rpp=100


For more sophisticated filtering you can specify a since_id parameter with the ID of a tweet (usually the ID of the
most recent tweet provided in the Atom or RSS response from a previous request).

If you want to analyze customer sentiment, the Twitter API includes semantic logic to categorize tweets and
positive or negative based on key terms used in the tweet content. You can filter the results of a search
based on sentiment by including an emoticon in the search string. For example, the following request returns up
to 100 tweets that mention @BlueYonderAirlines and are positive in sentiment:
http://search.twitter.com/search.rss?q=@BlueYonderAirlines :)&rpp=100
Similarly, the following request returns tweets that are negative in sentiment:
http://search.twitter.com/search.rss?q=@BlueYonderAirlines :(&rpp=100
In addition to the REST API, the Twitter Streaming API offers a way to consume an ongoing real-time stream of
tweets based on filtering criteria. Streaming requests are of the form:
https://stream.twitter.com/1.1/statuses/filter.json?track=@BlueYonderAirlines

You will see how the streaming API can be used in the section Capturing Real-Time Data from Twitter with
StreamInsight of this appendix.

For more information about Twitter APIs, see the developer documentation on the Twitter web site.

Using Interactive Data Ingestion
During the initial stage of the project, the team had decided to create a proof of concept with a small volume of
data. They obtained this interactively in Excel by sending requests to the search feature of Twitter, and manually
uploaded the results to HDInsight through the interactive console.
Interactive Data Collection Using Excel
To use Excel to extract a collection of tweets from Twitter the team used the Get External Data option on the
Data tab of the ribbon, selecting From Web in the drop-down list. The New Web Query dialog provides a textbox
where they specified a URL that corresponds to a REST-based search request, as shown in Figure 1.


Figure 1
Using Excel to extract search results from Twitter


The import process resulted in a table of data with a row for each tweet returned by the query, as shown in
Figure 2.


Figure 2
Twitter data in Excel

After the data had been imported, the team at Blue Yonder Airlines removed unnecessary columns and made
some formatting changes. To simplify analysis the team decided to include only the publication date, ID, title,
and author of each tweet, and to format the publication date using the format DD-MM-YYYY. These changes are
easy to make in Excel, and the worksheet was then saved as a tab-delimited text file. This process was repeated
on a regular basis to generate sufficient data for validating the results of the initial analysis.

Interactive Data Upload Using the JavaScript Console
To upload the text files to the HDInsight cluster for analysis the team used the fs.put() JavaScript command in
the interactive console area of the HDInsight dashboard. This requires only the path and name of the files to
upload and the destination path and name in HDInsight storage, as shown in Figure 3.


Figure 3
Uploading a file to HDInsight

To view the contents of a folder the team used the #ls command in the HDInsight interactive console, as shown
in Figure 4.


Figure 4
Browsing files in the HDInsight interactive console
To view the contents of a file after it had been uploaded the team used the #cat command.
Uploading Large Files
The interactive console is suitable for uploading small files, but it has an upper limit on the file size. To support
uploads of large files, and to improve the efficiency of data uploads, the team at Blue Yonder Airlines explored
using:
The Windows Azure SDK features available for Visual Studio, which provide access to blob storage in
Server Manager.
A third party utility such as those listed in the page How to Upload Data to HDInsight on the HDInsight
website.
A custom command-line upload utility.

They chose to create a simple utility that they can run from the command line by using the following syntax:
Command Line
BlobManager.UploadConsoleClient.exe source blob_store key container [, autoclose]
The arguments for the BlobManager.UploadConsoleClient.exe utility are described in the following table:
Argument Description
source A local file or folder. If a folder is specified, all files in that folder are uploaded.
blob_store The name of the Windows Azure Blob Store to which the data is to be uploaded
key The passkey used to authenticate the connection to the Windows Azure Blob Store.
container The name of the container in the Windows Azure Blob Store to which the data should be uploaded. The
uploaded files will be placed in a folder named data within the specified container.
autoclose An optional argument. If the value --autoclose is passed, the console application closes automatically
after the upload is complete. Otherwise, a prompt to press a key is displayed.
The upload utility splits the files into 4MB blocks and uses 16 I/O threads to upload 16 blocks (64MB) in parallel.
After each block is uploaded it is marked as committed and the next block is uploaded. When all of the blocks
are uploaded they are reassembled into the original files in Windows Azure blob storage.

The source code for the Blob Manager utility that Blue Yonder used is included in the associated Hands-on Labs
you can download for this guide. To download the Hands-on Labs go to
http://wag.codeplex.com/releases/view/103405. The Blob Manager code is in the Lab 4 - Enterprise BI
Integration\Labfiles\BlobUploader subfolder of the sample code.
Automating Data Ingestion with SSIS
Interactively downloading data from Twitter and uploading it to HDInsight is appropriate for initial proof of
concept testing. However, to provide a meaningful dataset the team at Blue Yonder Airlines must obtain and
upload data from Twitter regularly over a period of a several weeks. If they determine that the data analysis can
provide useful and valid information, they will want to consider automating the data ingestion process.
After the initial investigation of the data it was clear that it would be useful to continue the analysis process to
provide ongoing insights into the information it contains. To accomplish this, the team created a SQL Server
Integration Services (SSIS) package that extracts the data from Twitter and uploads it to Windows Azure blob
storage automatically every day. Using SSIS makes it easy to encapsulate a workflow that obtains and uploads
the Twitter data in an SSIS package, and the package can be deployed to an SSIS catalog in SQL Server and
scheduled for automatic execution at a regular interval.

A set of reusable components for accessing Windows Azure blob storage with SSIS can be
obtained from Azure Blob Storage Components for SSIS Sample on Codeplex.

The SSIS project includes the following project-level parameters:
Parameter Description
Query The search term used to filter tweets from the Twitter search API.
LocalFolder The local file system folder in which the search results downloaded from Twitter will be stored.
BlobstoreName The name of the Windows Azure blob storage account associated with the HDInsight cluster.
BlobstoreKey The passkey required to connect to Windows Azure blob storage.
BlobstoreContainer The container in Windows Azure blob storage to which the data will be uploaded.
The project contains a package named Tweets.dtsx, which defines the control flow shown in Figure 5.

Figure 5
SSIS control flow for loading Twitter data to HDInsight

The SQL Server Integration Services Package
The control flow for the Tweets.dtsx package consists of the following tasks:
1. Empty Local Folder: a File System task that deletes any existing files in the folder defined by the
LocalFolder parameter.
2. Get Tweets: a Data Flow task that extracts the data from Twitter and saves it as a local file in the folder
defined by the LocalFolder parameter.
3. Upload Tweets to Azure Blob Store: an Execute Process task that uses the command line tool discussed
previously to upload the local file to Windows Azure blob storage.

The Get Tweets data flow task is shown in Figure 6. It consists of a Script component named Twitter Search that
implements a custom data source, and a Flat File destination named Tweet File.


Figure 6
Data flow to extract data from Twitter

The package uses the following connection managers:
Connection Manager Type Connection String
TwitterConnection HTTP http://search.twitter.com/search.rss?rpp=100
TweetFileConnection File @[$Project::LocalFolder] + @[System::ExecutionInstanceGUID] + ".txt"
(This is an expression that uses the LocalFolder project parameter and
the unique runtime execution ID to generate a filename such as
C:\data\{12345678-ABCD-FGHI-JKLM-1234567890AB}.txt)
The Twitter Search script component is configured to use the TwitterConnection HTTP connection manager,
and has the following output fields defined:
TweetDate TweetID TweetAuthor TweetText
The Script component is granted read-only access to the Query parameter, and includes the following code.
C#
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Xml;
using System.ServiceModel.Syndication;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
private SyndicationFeed tweets = null;
private XmlReader tweetReader = null;

public override void PreExecute()
{
base.PreExecute();
string today = DateTime.Today.ToString("yyyy-MM-dd");
tweetReader = XmlReader.Create(Connections.Connection.ConnectionString
+ "&q=" + Variables.Query + " since:" + today);
tweets = SyndicationFeed.Load(tweetReader);
}

public override void PostExecute()
{
base.PostExecute();
}

public override void CreateNewOutputRows()
{
if (tweets != null)
{
foreach (var tweet in tweets.Items)
{
Output0Buffer.AddRow();
Output0Buffer.TweetDate = tweet.PublishDate.ToString("d");
Output0Buffer.TweetID = tweet.Id;
Output0Buffer.TweetAuthor = tweet.Authors[0].Email;
Output0Buffer.TweetText = tweet.Title.Text;
}
Output0Buffer.SetEndOfRowset();
}
}
}
Notice that the code combines the HTTP connection string and the parameterized query term to dynamically
generate a REST URI for the Twitter search API that filters the results to include only tweets since the start of the
current day. The code also includes a reference to the System.ServiceModel.Syndication namespace, which
includes classes that make it easier to work with RSS feeds.
The PublishDate, Id, first author Email, and Title fields from each tweet in the RSS results are assigned to the
output fields and sent through the data flow to the Tweet File destination, which uses the TweetFileConnection
connection manager to write the data to a tab-delimited file in the local file system.
After the data flow task has downloaded the data from Twitter, the Execute Process task named Upload Tweets
to Azure Blob Store in the control flow uses the custom BlobManager.UploadConsoleClient.exe console
application described earlier to upload the file to Windows Azure blob storage. The Execute Process task is
configured to use the following expression for the Arguments property (the command line arguments that are
passed to the executable):
Control Flow Task
@[$Project::LocalFolder] + " " + @[$Project::BlobstoreName] + " " +
@[$Project::BlobstoreKey] + " " + @[$Project::BlobstoreContainer] + " --autoclose"

After the SSIS project is complete it is deployed it to an SSIS Catalog server, and the Tweets.dtsx package is
scheduled to run each day at a specified time with the following parameter values:
Parameter Value
Query @BlueYonderAirlines
LocalFolder c:\data
BlobstoreName BlueYonderBlobStore
BlobstoreKey 1234567890abcdefghijklmnopqrstuvwxyz&%$@1234567890abcdefghi==
BlobstoreContainer blueyonderdata
This automatically retrieves data from Twitter and uploads it to HDInsight at a scheduled time each day.
Capturing Real-Time Data from Twitter with StreamInsight
The SSIS-based solution for ingesting Twitter data into HDInsight provides a way to capture a sample of tweets
each day. However, as the analysis of Twitter data is proving useful to the business, a refinement of the project
is to capture all of the tweets related to Blue Yonder Airlines in real-time and upload them to HDInsight in
frequent batches for more detailed analysis.
To accomplish this, the team at Blue Yonder Airlines investigated the Twitter streaming API and the use of
Microsoft StreamInsight. The Twitter streaming API returns a stream of tweets in JSON format, which the
StreamInsight application must serialize as events that are captured and written to a file. The file can then be
uploaded to Windows Azure blob storage when a suitable batch of tweets has been captured.
To create the StreamInsight solution, the developers created the following classes:
Tweet. An event class that represents a tweet event:
C#
[Serializable]
public class Tweet
{
public long ID { get; set; }
public DateTime CreatedAt { get; set; }
public string Text { get; set; }
public string Source { get; set; }
public string UserLang { get; set; }
public int UserFriendsCount { get; set; }
public string UserDescription { get; set; }
public string UserTimeZone { get; set; }
public string UserURL { get; set; }
public string UserName { get; set; }
public long UserID { get; set; }
public long DeleteStatusID { get; set; }
public long DeleteUserID { get; set; }
}
TweetSource. A class that inherits from the IObservable interface, and conceptually acts as a wrapper
around the Twitter streaming API. This class contains a Start method, which connects to the Twitter
streaming API and submits a request for tweets containing the term @BlueYonderAirlines. Each time
a JSON-format tweet is read from the response stream it is deserialized and sent to all observer classes
that are registered with this observable class.
C#
[Serializable]
public class TweetSource : IObservable<Tweet>
{
private const string TrackString = "@BlueYonderAirlines";
private static List<IObserver<Tweet>> observers = new
List<IObserver<Tweet>>();
private static string userName;
private static string password;
private static object sync = new object();

public TweetSource(string user, string pass)
{
userName = user;
password = pass;
}

public IDisposable Subscribe(IObserver<Tweet> observer)
{
lock (sync)
{
if (observers.Count == 0)
{
Start();
}
observers.Add(observer);
}
return null;
}

private static void Start()
{
ThreadPool.QueueUserWorkItem(
o =>
{
WebRequest request = HttpWebRequest.Create
("https://stream.twitter.com/1.1/statuses/filter.json?track="
+ TrackString);
request.Credentials = new NetworkCredential(userName, password);
var response = request.GetResponse();

using (StreamReader streamReader = new
StreamReader(response.GetResponseStream()))
{
while (true)
{
string line = streamReader.ReadLine();
if (string.IsNullOrWhiteSpace(line))
{
continue;
}

JavaScriptSerializer serializer = new JavaScriptSerializer();
Tweet tweet = serializer.Deserialize<Tweet>(line);

foreach (var observer in observers)
{
observer.OnNext(tweet);
}
}
}
},
null);
}
}
TwitterStreamProcessor. A class that encapsulates the StreamInsight complex event processing for the
application. The Start method in this class accepts parameters for the user name and password to be
used when connecting to Twitter, and an array of delegate methods that should be registered as
observers and called each time a tweet is read from the stream. It then:
Connects to the StreamInsight server and creates a CEP application named twitterApp.
Creates a source based on the TweetSource class described previously, and deploys that to the
StreamInsight server.
Creates a List class that inherits from the IObserver interface and includes each of the observer
delegate methods passed as parameters, which are then deployed it to the StreamInsight
application.
Defines a LINQ query to filter events as they flow through the StreamInsight engine, ensuring
that only tweets that are not flagged for deletion are received by the observer class.
C#
public class TwitterStreamProcessor
{
private IDisposable process;
private Server streamInsightServer;
private Application streamInsightApp;

public TwitterStreamProcessor()
{
}

public void Start(string userName, string password,
params Action<Tweet>[] observers)
{
if (observers.Length > 0)
{
this.streamInsightServer = Server.Create("StreamInsight21");
this.streamInsightApp =
this.streamInsightServer.CreateApplication("twitterApp");

var source = this.streamInsightApp.DefineObservable(() => new
TweetSource(userName, password)).ToPointStreamable(x =>
PointEvent.CreateInsert(DateTimeOffset.Now, x),
AdvanceTimeSettings.StrictlyIncreasingStartTime);

source.Deploy("twitterSource");

List<IRemoteObserver<Tweet>> remoteObservers = new
List<IRemoteObserver<Tweet>>();
for (int i = 0; i < observers.Length; i++)
{
Action<Tweet> action = observers[i];
var remoteObserver = this.streamInsightApp.DefineObserver(() =>
Observer.Create<Tweet>(action));
remoteObserver.Deploy("twitterObserver" + i);
remoteObservers.Add(remoteObserver);
}

var query =
from tweet in source
where tweet.DeleteStatusID == 0
select tweet;

IRemoteBinding binding = null;

foreach (var observer in remoteObservers)
{
if (binding == null)
{
binding = query.Bind(observer);
}
else
{
binding = binding.With(query.Bind(observer));
}
}
this.process = binding.Run("twitterProcessing");
}
}

public void Stop()
{
lock (this)
{
if (this.process != null)
{
this.process.Dispose();
this.process = null;
this.streamInsightApp.Delete();
this.streamInsightServer.Dispose();
}
}
}
}

After implementing the required classes, the Blue Yonder Airlines team created a simple Windows application
that calls the Start method of an instance of the TwitterStreamProcessor class, passing the names of functions
to be registered as observers. In this case the application uses two observers; one to write tweets to the user
interface in real-time, and another to write tweet data to a text file.
C#
this.twitterStreamProcessor.Start(
userNameTextBox.Text,
passwordTextBox.Text,
WriteTweetOnScreen,
SaveToFile);
The tweets that are captured by StreamInsight are displayed in the application user interface as shown in Figure
7, and they are also saved in a text file.

Figure 7
Capturing real-time Twitter data with StreamInsight

When a sufficient batch of tweets has been captured, the application uses the BlobManager library from the
custom application described earlier to upload the text file to Windows Azure blob storage as shown in the
following code:

C#
string conn = "DefaultEndpointsProtocol=https;AccountName="
+ txtBlobstoreName.Text + ";AccountKey=" + txtBlobstoreKey.Text;
BlobUploader u = new BlobUploader();
u.UploadBlob(conn, "BlueYonderData", "data/tweets.txt", targetFile,
BlobUploadMode.Append);


To discover how the team at Blue Yonder Airlines uses the captured data, see the Iterative Exploration
scenario in Chapter 8 of this guide, and the related example in the associated Hands-on Labs you can download
from http://wag.codeplex.com/releases/view/103405. The Visual Studio project for capturing real-time data
from Twitter is provided in the Labs\Iterative Exploration\Labfiles\TwitterStream folder of the Hands-on Labs
examples so that you can experiment with it yourself.


Appendix B: An Overview of Enterprise
Business Intelligence
Enterprise BI is a topic in itself, and a complete discussion of the design and implementation of a data
warehouse and BI solution is beyond the scope of this guide. However, you do need to understand some of its
specific features such as dimension and fact tables, slowly changing dimensions, and data cleansing if you intend
to integrate a Big Data solution such as HDInsight with your data warehouse. This appendix provides an
overview the main factors that make an enterprise BI system different from simple data management scenarios.

Even if you are already familiar with SQL Server as a platform for data warehousing and BI you may like to
review the contents of this appendix before reading about integrating HDInsight with Enterprise BI in Chapter 3,
and the Enterprise BI Integration scenario described in this guide.

An enterprise BI solution commonly includes the following key elements:
Business data sources.
An extract, transform, and load (ETL) process.
A data warehouse or departmental data marts.
Corporate data models.
Reporting and analytics.

Reporting and analytics is discussed in depth in Chapter 5 and elsewhere in the guide, and so
it is not included in this appendix.

Data flows from business applications and other external sources through an ETL process and into a data
warehouse. Corporate data models are then used to provide shared analytical structures such as online
analytical processing (OLAP) cubes or data mining models, which can be consumed by business users through
analytical tools and reports.
Some BI implementations also enable analysis and reporting directly from the data warehouse, enabling
advanced users such as business analysts and data scientists to create their own personal analytical data models
and reports.

Figure 1 shows the architecture of a typical enterprise BI solution.


Figure 1
Overview of a typical enterprise data warehouse and BI implementation
Business Data Sources
Most of the data in an enterprise BI solution originates in business applications as the result of some kind of
business activity. For example, a retail business might use an e-commerce application to manage a product
catalog and process sales transactions. In addition, most organizations use dedicated software applications for
managing core business activities, such as a human resources and payroll system to manage employees, and a
financial accounts application to record financial information such as revenue and profit.
Most business applications use some kind of data store, often a relational database. Additionally, the
organization may implement a master data management solution to maintain the integrity of data definitions
for key business entities across multiple systems. Microsoft SQL Server 2012 provides an enterprise-scale
platform for relational database management, and includes Master Data Services a solution for master data
management.
ETL
One of the core components in an enterprise BI infrastructure is the ETL solution that extracts data from the
various business data sources, transforms the data to meet structure, format, and validation requirements, and
loads the transformed data into the data warehouse. Typically, the ETL process performs an initial load of the
data warehouse tables, and thereafter performs a regularly scheduled refresh cycle to synchronize the data
warehouse with changes in the source systems. The ETL solution must therefore be able to detect changes in
source systems in order to minimize the amount of time and processing overhead required to extract data, and
intelligently insert or update rows in the data warehouse tables based on requirements to maintain historic
values for analysis and reporting.


Figure 2
An SSIS data flow
In most enterprise scenarios, the ETL process extracts data from business data sources and stores it in an interim
staging area before loading the data warehouse tables. This enables the ETL process to consolidate data from
multiple sources before the data warehouse load operation, and to restrict access to data sources and the data
warehouse to specific time periods that minimize the impact of the ETL processing overhead on normal business
activity.
SQL Server 2012 includes SQL Server Integration Services (SSIS), which provides a platform for building
sophisticated ETL processes. An SSIS solution consists of one or more packages, each of which encapsulates a
control flow for a specific part of the ETL process. Most control flows include one or more data flows, in which a
pipeline-based, in-memory sequence of tasks is used to extract, transform, and load data. Figure 2 shows a
simple SSIS data flow that extracts customer address data and loads it into a staging table.
Notice that the data flow in Figure 2 includes a task named DQS Cleansing. This reflects the fact that an
important requirement in an ETL solution is that the data being loaded into the data warehouse is as accurate
and error-free as possible so that the data used for analysis and reporting is reliable enough to base important
business decisions on.


Figure 3
A DQS knowledge base
To understand the importance of this, consider an e-commerce site in which users enter their address details in
a free-form text box. This can lead to data input errors (for example, mistyping New York as New Yok) and
inconsistencies (for example, one user might enter United States, while another enters USA, and a third enters
US). While these errors and inconsistencies may not materially affect the operations of the business application
(it is unlikely that any of these issues would prevent the delivery of an order), they may cause inaccuracies in
reports and analyses. For example, a report showing total sales by country would include three separate values
for the United States, which could be misleading when displayed as a chart.
SQL Server 2012 includes Data Quality Services (DQS). To help resolve data quality issues, business users with a
deep knowledge of the data, who are often referred to as data stewards, can create a DQS knowledgebase that
defines common errors and synonyms that must be corrected to the leading values. For example, Figure 3 shows
the DQS knowledgebase referenced by the data flow in Figure 2, in which New Yok, New York City, and NYC are
all corrected to the leading value New York.
The Data Warehouse
A data warehouse is a central repository for the data extracted by the ETL processes, and provides a single,
consolidated source of data for reporting and analysis. In some organizations the data warehouse is
implemented as a single database containing data for all aspects of the business, while in others the data may
be factored out into smaller departmental data marts.
Data Warehouse Schema Design
Regardless of the specific topology used, most data warehouses are based on a dimensional design in which
numeric business measures such as revenue, cost, stock units, account balances, and so on are stored in
business process-specific fact tables. For example, measures related to sales orders in the retail business
discussed previously might be stored in fact table named FactInternetSales. The facts are stored at a specific
level of granularity, knows as the grain of the fact table, that reflects the lowest level at which you need to
aggregate the data. In the example of the FactInternetSales table, you could store a record for each order, or
you could enable aggregation at a lower level of granularity by storing a record for each line item within an
order. Measures related to stock management might be stored in a FactStockKeeping table, and measures
related to manufacturing production might be stored in a FactManufacturing table.
Records representing the business entities by which you want to aggregate the facts are stored in dimension
tables. For example, records representing customers might be stored in a table named DimCustomer, and
records representing products might be stored in a table named DimProduct. Additionally, most data
warehouses include a dimension table to store time period values, for example a row for each day to which a
fact record may be related. The dimension tables are conformed across the fact tables that represent the
business processes being analyzed. That is to say, the records in a dimension table (for example, DimProduct)
represent an instance of a business entity (in this case, a product), that can be used to aggregate facts in any fact
table that involves the same business entity. For example, the same DimProduct records can be used to
represent products in the business processes represented by the FactInternetSales, FactStockKeeping, and
FactManufacturing tables.
To optimize query performance, a dimensional model uses minimal joins between tables. This approach results
in a denormalized database schema in which data redundancy is tolerated in order to reduce the number of
tables and relationships. The ideal dimensional model is a star schema in which each fact table is related directly
to its dimension tables as shown in Figure 4.

Figure 4
A star schema for a data warehouse
In some cases there may be compelling reasons to partially normalize a dimension. One reason for this is to
enable different fact tables to be related to attributes of the dimension at different levels of grain; for example,
so that sales orders can be aggregated at the product level while marketing campaigns are aggregated at the
product category level.
Another reason is to enable the same attributes to be shared across multiple dimensions; for example, storing
geographical fields such as country or region, state, and city in a separate geography dimension table that can
be used by both a customer dimension table and a sales region dimension table. This partially normalized design
for a data warehouse is known as a snowflake schema, as shown in Figure 5.

Figure 5
A snowflake schema for a data warehouse
Retaining Historical Dimension Values
Note that dimension records are uniquely identified by a key value. This key is usually a surrogate key that is
specific to the data warehouse, and not the same as the key value used in the business data source where the
dimension data originated. The original business key for the dimension records is usually retained in an alternate
key column. The main reason for generating a surrogate key for dimension records is to enable the retention of
historical versions of the dimension members.
For example, suppose a customer in New York registers with the retailer discussed previously and makes some
purchases. After a few years, the customer may move to Los Angeles and then make some more purchases. In
the e-commerce application used by the website, the customers profile record is simply updated with the new
address. However, changing the address in the customer record in the data warehouse would cause historical
reporting of sales by city to become inaccurate. Sales that were originally made to a customer in New York
would now be counted as sales in Los Angeles.
To avoid this problem, a new record for the customer is created with a new, unique surrogate key value. The
new record has the same alternate key value as the existing record, indicating that both records relate to the
same customer, but at different points in time. To keep track of which record represents the active instance of
the customer at a specific time, the dimension table usually includes a Boolean column to indicate the current
record, and often includes start and end date columns that reflect the period to which the record relates. This is
shown in the following table.
CustKey AltKey Name City Current Start End
1001 212 Jesper Aaberg New York False 2008-01-01 2012-08-01

4123 212 Jesper Aaberg Los Angeles True 2012-08-01 NULL
This approach for retaining historical versions of dimension records is known as a slowly changing dimension
(SCD). It is important to understand that an SCD might contain multiple dimension records for the same business
entity when implementing an ETL process that loads dimension tables, and when integrating analytical data
(such as data obtained from HDInsight) with dimension records in a data warehouse.

Strictly speaking, the example shown in the table is a Type 2 slowly changing dimension. Some dimension
attributes can be changed without retaining historical versions of the records, and these are known as Type 1
changes.

Implementing a Data Warehouse with SQL Server
SQL Server 2012 Enterprise Edition includes a number of features that make it a good choice for data
warehousing. In particular, the SQL Server query engine can identify star schema joins and optimize query
execution plans accordingly. Additionally, SQL Server 2012 introduces support for columnstore indexes, which
use an in-memory compression technique to dramatically improve the performance of typical data warehouse
queries.
For extremely large enterprise data warehouses that need to store huge volumes of data, Microsoft has
partnered with hardware vendors to offer an enterprise data warehouse appliance. The appliance uses a special
edition of SQL Server called Parallel Data Warehouse (PDW), in which queries are distributed in parallel across
multiple processing and storage nodes in a shared nothing architecture. This technique is known as massively
parallel processing (MPP), and offers extremely high query performance and scalability.

For more information about SQL Server Parallel Data Warehouse, see Appliance: Parallel Data Warehouse (PDW)
on the SQL Server website.
Corporate Data Models
A data warehouse is designed to structure business data in a schema that makes it easy to write queries that
aggregate data, and many organizations perform analysis and reporting directly from the data warehouse tables.
However, most enterprise BI solutions include analytical data models as a layer of abstraction over the data
warehouse to improve performance of complex analytical queries, and to define additional analytical structures
such as hierarchies of dimension attributes. For example, a data model may aggregate sales by customer at the
country or region, state, city, and individual customer levels as well as exposing key performance indicators such
as defining a score that compares actual sales revenue to targeted sales revenue.
These data models are usually created by BI developers and used as a data source for corporate reporting and
analytics. Analytical data models are generically known as cubes, because they represent a multidimensional
view of the data warehouse tables in which data can be sliced and diced to show aggregated measure values
across multiple dimension members.
SQL Server 2012 Analysis Services (SSAS) supports the following two data model technologies:
Multidimensional databases. Multidimensional SSAS databases support traditional OLAP cubes that are
stored on disk, and are compatible with SSAS databases created with earlier releases of SQL Server
Analysis Services. Multidimensional data models support a wide range of analytical features. A
multidimensional database can also host data mining models, which apply statistical algorithms to
datasets in order to perform predictive analysis.
Tabular databases. Tabular databases were introduced in SQL Server 2012 and use an in-memory
approach to analytical data modeling. Tabular data models are based on relationships defined between
tables, and are generally easier to create than multidimensional models while offering comparable
performance and functionality. Tabular models support most (but not all) of the capabilities of
multidimensional models. Additionally, they provide an upscale path for personal data models
created with PowerPivot in Microsoft Excel.


When you install SSAS, you must choose between multidimensional and tabular modes. An instance of SSAS
installed in multidimensional mode cannot host tabular databases, and conversely a tabular instance cannot
host multidimensional databases. You can however install two instances of SSAS (one in each mode) on the
same server.

Figure 6 shows a multidimensional data model project in which a cube contains a Date dimension. The Date
dimension includes two hierarchies, one for fiscal periods and another for calendar periods.


Figure 6
A multidimensional data model project

Figure 7 shows a tabular data model that includes the same dimensions and hierarchies as the multidimensional
model in Figure 6.


Figure 7
A tabular data model project

This appendix has described only the key aspects of data warehousing and BI that are necessary to understand
how to integrate HDInsight into a BI solution. If you want to learn more about building BI solutions with
Microsoft SQL Server and other components of the Microsoft BI platform, consider referring to The Microsoft
Data Warehouse Toolkit (Wiley) or attending Microsoft Official Curriculum courses 10777 (Implementing a
Data Warehouse with Microsoft SQL Server 2012), 10778 (Implementing Data Models and Reports with
Microsoft SQL Server 2012), and 20467 (Designing Business Intelligence Solutions with Microsoft SQL Server
2012). You can find details of all Microsoft training courses at Training and certification for Microsoft products
and technologies on the Microsoft.com Learning website.

Vous aimerez peut-être aussi