This guide is based on the May 2013 release of HDInsight.
Later releases may differ from the version described in this guide.
Copyright .......................................................................................................................................................6 Preface ..........................................................................................................................................................7 What This Guide Is, and What It Is Not ..................................................................................................................8 Who This Guide Is For .............................................................................................................................................9 Why This Guide Is Pertinent Now ...........................................................................................................................9 A Roadmap for This Guide ................................................................................................................................... 10 Using the Examples for this Guide ...................................................................................................................... 12 Who's Who .......................................................................................................................................................... 13 Chapter 1: What is Big Data? ........................................................................................................................ 14 What Is Big Data? ................................................................................................................................................ 14 What Is Microsoft HDInsight? ............................................................................................................................. 26 Where Do I Go Next? ........................................................................................................................................... 33 Summary .............................................................................................................................................................. 34 More Information ................................................................................................................................................ 34 Chapter 2: Getting Started with HDInsight .................................................................................................... 35 An Overview of the Big Data Process .................................................................................................................. 35 Determining Analytical Goals and Source Data ................................................................................................... 38 Planning and Configuring the Infrastructure ....................................................................................................... 42 Summary .............................................................................................................................................................. 66 More Information ................................................................................................................................................ 67 Chapter 3: Collecting and Loading Data ......................................................................................................... 68 Challenges............................................................................................................................................................ 68 Common Data Load Patterns for Big Data Solutions ........................................................................................... 70 Uploading Data to HDInsight ............................................................................................................................... 75 Summary .............................................................................................................................................................. 81 More Information ................................................................................................................................................ 81 Chapter 4: Performing Queries and Transformations..................................................................................... 82 Challenges............................................................................................................................................................ 82 Processing the Data ............................................................................................................................................. 83 Evaluating the Results ....................................................................................................................................... 107 Tuning the Solution ........................................................................................................................................... 109 Summary ............................................................................................................................................................ 120 More Information .............................................................................................................................................. 120 Chapter 5: Consuming and Visualizing the Results ....................................................................................... 121 Challenges.......................................................................................................................................................... 121 Options for Consuming Data from HDInsight .................................................................................................... 122 Consuming HDInsight Data in Microsoft Excel .................................................................................................. 124 Consuming HDInsight Data in SQL Server Reporting Services .......................................................................... 134 Consuming HDInsight Data in SQL Server Analysis Services ............................................................................. 139 Consuming HDInsight Data in a SQL Server Database....................................................................................... 143 Custom and Third Party Analysis and Visualization Tools ................................................................................. 149 Summary ............................................................................................................................................................ 150 More Information .............................................................................................................................................. 150 Chapter 6: Debugging, Testing, and Automating HDInsight Solutions ........................................................... 151 Challenges.......................................................................................................................................................... 151 Debugging and Testing HDInsight Solutions...................................................................................................... 152 Executing HDInsight Jobs Remotely .................................................................................................................. 155 Creating and Using Automated Workflows ....................................................................................................... 159 Building End-to-End Solutions on HDInsight ..................................................................................................... 164 Summary ............................................................................................................................................................ 165 More Information .............................................................................................................................................. 165 Chapter 7: Managing and Monitoring HDInsight ......................................................................................... 166 Challenges.......................................................................................................................................................... 166 Administering HDInsight Clusters and Jobs ....................................................................................................... 167 Configuring HDInsight Clusters and Jobs ........................................................................................................... 173 Retaining Data and Metadata for a Cluster ....................................................................................................... 180 Monitoring HDInsight Clusters and Jobs ........................................................................................................... 183 Summary ............................................................................................................................................................ 186 More Information .............................................................................................................................................. 186 Scenario 1: Iterative Data Exploration ......................................................................................................... 187 Finding Insights in Data ..................................................................................................................................... 187 Introduction to Blue Yonder Airlines ................................................................................................................. 188 Analytical Goals and Data Sources .................................................................................................................... 188 Iterative Data Processing ................................................................................................................................... 190 Consuming the Results ...................................................................................................................................... 211 Summary ............................................................................................................................................................ 213 Scenario 2: Extract, Transform, Load ........................................................................................................... 214 Introduction to Northwind Traders ................................................................................................................... 214 ETL Process Goals and Data Sources ................................................................................................................. 214 The ETL Workflow .............................................................................................................................................. 215 Encapsulating the ETL Tasks in an Oozie Workflow .......................................................................................... 231 Automating the ETL Workflow with the Microsoft .NET SDK for Hadoop ........................................................ 240 Summary ............................................................................................................................................................ 245 Scenario 3: HDInsight as a Data Warehouse ................................................................................................ 246 Introduction to A. Datum .................................................................................................................................. 246 Analytical Goals and Data Sources .................................................................................................................... 246 Creating the Data Warehouse ........................................................................................................................... 247 Loading Data into the Data Warehouse ............................................................................................................ 250 Optimizing Query Performance ......................................................................................................................... 254 Analyzing Data from the Data Warehouse ........................................................................................................ 258 Summary ............................................................................................................................................................ 261 Scenario 4: Enterprise BI Integration ........................................................................................................... 262 Introduction to Adventure Works ..................................................................................................................... 262 The Existing Enterprise BI Solution .................................................................................................................... 262 The Analytical Goals .......................................................................................................................................... 268 The HDInsight Solution ...................................................................................................................................... 269 Report-Level Integration ................................................................................................................................... 272 Corporate Data Model Integration .................................................................................................................... 276 Data Warehouse Integration ............................................................................................................................. 280 Summary ............................................................................................................................................................ 290 Appendix A: Data Ingestion Patterns using the Twitter API.......................................................................... 291 About the Twitter API ........................................................................................................................................ 291 Using Interactive Data Ingestion ....................................................................................................................... 293 Automating Data Ingestion with SSIS ................................................................................................................ 297 Capturing Real-Time Data from Twitter with StreamInsight ............................................................................. 302 Appendix B: An Overview of Enterprise Business Intelligence ...................................................................... 308 Business Data Sources ....................................................................................................................................... 309 ETL ..................................................................................................................................................................... 310 The Data Warehouse ......................................................................................................................................... 312 Corporate Data Models ..................................................................................................................................... 316
Copyright This document is provided as-is. Information and views expressed in this document, including URL and other Internet website references, may change without notice. You bear the risk of using it. Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred. 2013 Microsoft. All rights reserved. Microsoft, Microsoft Dynamics, Active Directory, MSDN, SharePoint, SQL Server, Visual C#, Visual C++, Visual Basic, Visual Studio, Windows, Windows Azure, Windows Live, Windows PowerShell, Windows Server, and Windows Vista are trademarks of the Microsoft group of companies. All other trademarks are the property of their respective owners. Preface Do you know what visitors to your website really think about your carefully crafted content? Or, if you run a business, can you tell what your customers actually think about your products or services? Did you realize that your latest promotional campaign had the biggest effect on people aged between 40 and 50 living in Wisconsin (and, more importantly, why)? Being able to get answers to these kinds of questions is increasingly vital in today's competitive environment, but the source data that can provide these answers is often hidden away; and when you can find it, it's very difficult to analyze successfully. It might be distributed across many different databases or files, be in a format that is hard to process, or may even have been discarded because it didnt seem useful at the time. To resolve these issues, data analysts and business managers are fast adopting techniques that were commonly at the core of data processing in the past, but have been sidelined in the rush to modern relational database systems and structured data storage. The new buzzword is Big Data, and the associated solutions encompass a range of technologies and techniques that allow you to extract real, useful, and previously hidden information from the often very large quantities of data that previously may have been left dormant and, ultimately, thrown away because storage was too costly. In the days before Structured Query Language (SQL) and relational databases, data was typically stored in flat files, often is simple text format, with fixed width columns. Application code would open and read one or more files sequentially, or jump to specific locations based on the known line width of a row, to read the text and parse it into columns to extract the values. Results would be written back by creating a new copy of the file, or a separate results file. Modern relational databases put an end to all this, giving us greater power and additional capabilities for extracting information simply by writing queries in a standard format such as SQL. The database system hides all the complexity of the underlying storage mechanism and the logic for assembling the query that extracts and formats the information. However, as the volume of data that we collect continues to increase, and the native structure of this information is less clearly defined, we are moving beyond the capabilities of even enterprise- level relational database systems. Big Data solutions provide techniques for storing vast quantities of structured, semi-structured, and unstructured data that is distributed across many data stores, and querying this data where it's stored instead of moving it all across the network to be processed in one locationas is typically the case with relational databases. The huge volumes of data, often multiple Terabytes or Petabytes, means that distributed queries are far more efficient than streaming all the source data across a network. Big Data solutions also deal with the issues of data formats by allowing you to store the data in its native form, and then apply a schema to it later, when you need to query it. This means that you dont inadvertently lose any information by forcing the data into a format that may later prove to be too restrictive. It also means that you can simply store the data noweven though you dont know how, when, or even whether it will be usefulsafe in the knowledge that, should the need arise in the future, you can extract any useful information it contains. Microsoft offers a Big Data solution called HDInsight, which is currently available as an online service in Windows Azure. There is also a version of the core HortonWorks Data Platform (on which HDInsight is based) that you can install on Windows Server as an on-premises mechanism. This guide primarily focuses on HDInsight for Windows Azure, but explores more general Big Data techniques as well. For example, it provides guidance on collecting and storing the data, and designing and implementing real-time and batch queries using the Hive and Pig query languages, as well as using custom map/reduce components. Big Data solutions can help you to discover information that you didnt know existed, complement your existing knowledge about your business and your customers, and boost competitiveness. By using the cloud as the data store and Windows Azure HDInsight as the query mechanism you benefit from very affordable storage costs (at the time of writing 1TB of Windows Azure storage costs only $95 per month), and the flexibility and elasticity of the pay and go model where you only pay for the resources you use.
What This Guide Is, and What It Is Not Big Data is not a new concept. Distributed processing has been the mainstay of supercomputers and high performance data storage clusters for a long time. Whats new is standardization around a set of open source technologies that make distributed processing systems much easier to build, combined with the growing need to manage and get information from the ever increasing volume of data that modern society generates. However, as with most new technologies, Big Data is surrounded by a great deal of hype that often gives rise to unrealistic expectations, and ultimate disappointment. Like all of the releases from Microsoft patterns & practices, this guide avoids the hype by concentrating on the whys and the hows. In terms of the whys, the guide explains the concepts of Big Data, gives a focused view on what you can expect to achieve with a Big Data solution, and explores the capabilities of Big Data in detail so that you can decide for yourself if it is an appropriate technology for your own scenarios. The guide does this by explaining what a Big Data solution can do, how it does it, and the types of problems it is designed to solve. In terms of the hows the guide continues with a detailed examination of the typical architectural models for Big Data solutions, and the ways that these models integrate with the wider data management and business information environment, so that you can quickly understand how you might apply a Big Data solution in your own environment. The guide then focuses on the technicalities of implementing a Big Data solution using Windows Azure HDInsight. It does this by helping you understand the fundamentals of the technology, when it is most useful in different scenarios, and the advantages and considerations that are relevant. This technical good-practice guidance is enhanced by a series of examples of real-world scenarios that focus on different areas where Big Data solutions are particularly relevant and useful. For each scenario you will see the aims, the design concept, the tools, and the implementation; and you can download the example code to run and experiment with yourself. This guide is not a technical reference manual for Big Data. It doesnt attempt to cover every nuance of implementing a Big Data solution, or writing complicated code, or pushing the boundaries of what the technology is designed to do. Neither does it cover all of the myriad details of the underlying mechanismthere is a multitude of books, websites, and blogs that explain all these things. For example, you wont find in this guide a list of all of the configuration settings, the way that memory is allocated, the syntax of every type of query, and the many patterns for writing map and reduce components. What you will see is guidance that will help you understand and get started using HDInsight to build realistic solutions capable of answering real world questions.
This guide is based on the May 2013 release of HDInsight on Windows Azure. Later releases of HDInsight may differ from the version described in this guide.
Who This Guide Is For This guide is aimed at anyone who is already (or is thinking of) collecting large volumes of data, and needs to analyze it to gain a better business or commercial insight into the information hidden inside. This includes business managers, data analysts, database administrators, business intelligence (BI) system architects, and developers that need to create queries against massive data repositories.
Why This Guide Is Pertinent Now Businesses and organizations are increasingly collecting huge volumes of data that may be useful now or in the future, and they need to know how to store and query it to extract the hidden information it contains. This might be web server log files, click-through data, financial information, medical data, user feedback, or a range of social sentiment data such as tweets or comments to blog posts. Big Data techniques and mechanisms such as HDInsight provide a mechanism to efficiently store this data, analyze it to extract useful information, and export data to tools and applications that can visualize the information in a range of ways. It is, realistically, the only way to handle the volume and the inherent unstructured nature of this data. No matter what type of service you provide, what industry or market sector you are in, or even if you only run a blog or website forum, you are highly likely to benefit from collecting and analyzing data that is easily available, often collected automatically (such as server log files), or can be obtained from other sources and combined with your own data to help you better understand your customers, your users, and your business; and to help you plan for the future.
A Roadmap for This Guide This guide contains a range of information to help you understand where and how you might apply Big Data solutions. This is the map of our guide:
Chapter 1, What is Big Data? provides an overview of the principles and benefits of Big Data solutions, and the differences between these and the more traditional database systems, to help you decide where, when, and how you might benefit from adopting a Big Data solution. This chapter also discusses Windows Azure HDInsight, and its place within the comprehensive Microsoft data platform. Chapter 2, Getting Started with HDInsight, explains how you can implement a Big Data solution using Windows Azure HDInsight. It contains general guidance for applying Big Data techniques by exploring in more depth topics such as defining the goals, locating data sources, the typical architectural approaches, and more. You will find much of the information about the design and development lifecycle of a Big Data solution useful, even if you choose not to use HDInsight as the platform for your own solution. Chapter 3, Collecting and Loading Data, explores the patterns and techniques for loading data into an HDInsight cluster. It describes several different approaches such as handling streaming data, automating the process, and source data compression. Chapter 4, Performing Queries and Transformations, describes the tools you can use in HDInsight to query and manipulate data in a cluster. This includes using the interactive console; and writing scripts and commands that are executed from the command line. The chapter also includes advice on evaluating the results and maximizing query performance. Chapter 5, Consuming and Visualizing the Results, explores the ways that you can transfer the results from HDInsight into analytical and visualization tools to generate reports and charts, and how you can export the results into existing data stores such as databases, data warehouses, and enterprise BI systems. Chapter 6, Debugging, Testing, and Automating HDInsight, explores how, if you discover a useful and repeatable process that you want to integrate with your existing business systems, you can automate all or part of the process. The chapter looks at techniques for testing and debugging queries, unit testing your code, and using workflow tools and frameworks to automate your HDInsight solutions. Chapter 7, Managing and Monitoring HDInsight, explores how you can manage and monitor an HDInsight solution. It includes topics such as configuring the cluster, backing up the data, remote administration, and using the logging capabilities of HDInsight.
The remaining chapters of the guide concentrate on specific scenarios for applying a Big Data solution, ranging from analysis of social media data to integration with enterprise BI systems. Each chapter explores specific techniques and solutions relevant to the scenario, and shows an implementation based on Windows Azure HDInsight. The scenarios the guide covers are: Chapter 8, Scenario 1: Iterative Exploration, demonstrates the way that you can use HDInsight to see if there is any useful information to be found in unstructured data that you may already be collecting. This could be data that refers to your products and services, such as emails, comments, feedback, and social media posts, or even data that refers to a specific topic or market sector that you are interested in investigating. In the example you will see how a company used data from Twitter to look for interesting information about its services. Chapter 9, Scenario 2: Extract, Transform, Load, explores the use of HDInsight as a tool for performing ETL operations. It describes how you can capture data that is not in the traditional row and column format (in this example, XML formatted data) and apply data cleansing, validation, and transformation processes to convert the data into the form required by your database system. The scenario centers on analyzing financial data. In addition to using the information to predict future patterns and make appropriate business decisions, it also provides the capability to detect unusual events that may indicate fraudulent activity. Chapter 10, Scenario 3: HDInsight as a Data Warehouse, describes how you can use HDInsight as both a commodity store for huge volumes of data, and as a simple data warehouse mechanism that can power your organizations applications. The scenario uses geospatial data about extreme weather events in the US, and shows how you can design and implement a data warehouse that can store, process, enrich, and then expose this data to visualization tools.
Chapter 11, Scenario 4: Enterprise BI Integration covers one of the most common scenarios for Big Data solutions: analyzing log files to obtain information about visitors or users, and visualizing this information in a range of different ways. This information can help you to better understand trends and traffic patterns; plan for the required capacity now and into the future; and discover which services, products, or areas of a website are underperforming. In this example you will see how a company was able to extract useful data and then integrate it with existing BI systems at different levels, and by using different tools.
Finally, the appendices contain additional information related to the scenarios described in the guide: Appendix A, Data Ingestion Patterns using the Twitter API, demonstrates the three main patterns for loading data into HDInsight. These patterns are interactive data ingestion, automated data ingestion using SQL Server Integration Services, and capturing streaming data using Microsoft StreamInsight. Appendix B, Enterprise BI Integration, provides additional background information about enterprise data warehouses and BI that, if you are not familiar with these systems, you will find helpful when reading Chapter 2 and the BI Integration scenarios described in the guide. Using the Examples for this Guide The scenarios you see described in this guide explore the techniques and technologies of HDInsight, Hadoop, and many related products and frameworks. The chapters in the guide discuss the goals for each scenario and the practical solutions to the challenges encountered, as well as providing an overview of the implementation. If you want to try out these scenarios yourself, you can use the Hands-on Labs that are available separately for this guide. These labs contain all of the source code, scripts, and other items used in all of the scenarios; together with detailed step by step instructions on accomplishing each task in the scenarios. To explore these labs, download them from http://wag.codeplex.com/releases/view/103405 and follow these steps: Save the download file to your hard drive in a temporary folder. The samples are packaged as a zip file. Unzip the contents into your chosen destination folder. This must not be a folder in your profile. Use a folder near the root of your drive with a short name, such as C:\Labs, to avoid issues with the length of filenames. See the document named Introduction and Setup in the root folder of the downloaded files for instructions on installing the prerequisite software for the labs.
To experiment with HDInsight you must have a Windows Azure subscription. You will need a Microsoft account to sign up for Windows Azure subscription. You can then create an HDInsight cluster using the Windows Azure Management Portal. For a detailed explanation of the other prerequisites for the scenarios described guide, see the Setup and Installation document in the root folder of the Hands-on Labs associated with this guide. Who's Who This guide explores the use of Big Data solutions using Windows Azure HDInsight. A panel of experts comments on the design and development of the solutions for the scenarios discussed in the guide. The panel includes a business analyst, a data steward, a software developer, and an IT professional. The delivery of the scenario solutions can be considered from each of these points of view. The following table lists these experts.
Bharath is a systems architect who specializes in Big Data solutions and BI integration. He ensures that the solutions he designs will meet the organizations needs, work reliably and efficiently, and provide tangible benefits. He is a cautious person, for good reasons. "Designing Big Data solutions can be a challenge, but the benefits this technology offers can help you to understand more about your company, products, and customers; and help you plan for the future".
Jana is a business analyst. She is responsible for planning the organizations business intelligence systems, and managing the quality of the data held in the organizations databases and data warehouse. She uses her deep knowledge of corporate data schemas to oversee the validation of the data, and correction of errors such as removing duplication and enforcing common terminology, to ensure that it provides the correct results. "Business intelligence is a core part of all corporations today. Its vital that they provide accurate, timely, and useful information, which means we must be able to trust the data we use."
Poe is an IT professional and database administrator whose job is to plan and manage the implementation of all aspects of the infrastructure for his organization. He is responsible for implementing the global infrastructure and database systems, including both on-premises and cloud based systems and services, so that they support the data management, analytics, and business intelligence mechanisms the organization requires. "I need to implement and manage our systems in a way that provides the best performance, security, and management capabilities possible while making sure our infrastructure, data storage systems, and applications are reliable and meet the needs of our users."
Markus is a database and application developer. He is an expert in programming for database systems and building applications, both in the cloud and in the corporate datacenter. He is analytical, detail- oriented, and methodical; and focused on the task at hand, which is building successful business solutions and analytics systems. "For the most part, a lot of what we know about database programming and software development can be applied to Big Data solutions. But there are always special considerations that are very important."
If you have a particular area of interest, look for notes provided by the specialists whose interests align with yours. Chapter 1: What is Big Data? This chapter discusses the general concepts of Big Data. It describes the ways that Big Data solutions, and Microsoft's HDInsight for Windows Azure, differ from traditional data management and business intelligence (BI) techniques. The chapter also sets the scene for the guide and provides pointers to the subsequent chapters that describe the implementation of Big Data solutions suited to different scenarios, and to meet a range of data analysis requirements. What Is Big Data? The term Big Data is being used to encompass an increasing range of technologies and techniques. In essence, Big Data is data that is valuable but, traditionally, was not practical to store or analyze due to limitations of cost or the absence of suitable mechanisms. Typically it refers to collections of datasets that, due to the size and complexity, are difficult to store, query, and manage using existing data management tools or data processing applications. You can also think of Big Data as data, often produced at fire hose rate, that you don't know how to analyze at the moment but which may provide valuable in the future. Big Data solutions aim to provide data storage and querying functionality for situations such as this. They offer a mechanism for organizations to extract meaningful, useful, and often vital information from the vast stores of data they are collecting. Big Data is often described as a solution to the three V's problem: Variety: It's not uncommon for 85% of new data to not match any existing data schema. It may also be semi-structured or unstructured data. This means that applying schemas to the data before or during storage is no longer a practical option. Volume: Big Data solutions typically store and query hundreds of Terabytes of data, and the total volume is probably growing by ten times every five years. Storage must be able to manage this volume, be easily expandable, and work efficiently across distributed systems. Velocity: Data is being collected from many new types of devices from a growing number of users, and an increasing number of devices and applications per user. The design and implementation of storage must happen quickly and efficiently.
The quintessential aspect of Big Data is not the data itself; its the ability to discover useful information hidden in the data. Its really all about the analytics that a Big Data solution can empower. Many people consider Big Data solutions to be a new way to do data warehousing when the volume of data exceeds the capacity or cost limitations for relational database systems. However, it can be difficult to fully grasp what Big Data solutions really involve, what hardware and software they use, and how and when they are useful. There are some basic questions, the answers to which will help you understand where Big Data solutions are useful - and how you might approach the topic in terms of implementing your own solutions. We'll work through these questions now. Why Do I Need a Big Data Solution? In the most simplistic terms, organizations need a Big Data solution to enable them to survive in a rapidly expanding and increasingly competitive market. For example, they need to know how their products and services are perceived in the market, what customers think of the organization, whether advertising campaigns are working, and which facets of the organization are (or are not) achieving their aims. Organizations typically collect data that is useful for generating BI reports, and to provide input for management decisions. However, organizations are increasingly implementing mechanisms that collect other types of data such as sentiment data (emails, comments from web site feedback mechanisms, and tweets that are related to the organization's products and services), click-through data, information from sensors in users' devices (such as location data), and website log files. Successful organizations typically measure performance by discovering the customer value that each part of their operation generates. Big Data solutions provide a way to help you discover value, which often cannot be measured just through traditional business methods such as cost and revenue analysis. These vast repositories of data often contain useful, and even vital, information that can be used for product and service planning, coordinating advertising campaigns, improving customer service, or as an input to reporting systems. This information is also very useful for predictive analysis such as estimating future profitability in a financial scenario, or for an insurance company to predict the possibility of accidents and claims. Big Data solutions allow you to store and extract all this information, even if you dont know when or how you will use the data at the time you are collecting it. In addition, many organizations need to handle vast quantities of data as part of their daily operations. Examples include financial and auditing data, and medical data such as patients' notes. Processing, backing up, and querying all of this data becomes more complex and time consuming as the volume increases. Big Data solutions are designed to store vast quantities of data on distributed servers with automatic generation of replicas to guard against data loss, together with mechanisms for performing queries on the data to extract the information the organization requires. What Problems Do Big Data Solutions Solve? Big Data solutions were initially seen to be primarily a way to resolve the limitation with traditional database systems due to the volume, variety, and velocity associated with Big Data; they are built to store and process hundreds of Terabytes, or even Petabytes, of data. This isnt to say that relational databases have had their day. Continual development of the hardware and software for this core business function provides capabilities for storing very large amounts of data; for example Microsoft Parallel Data Warehouse can store hundreds of Terabytes of data. However, Big Data solutions are optimized for the storage of huge volumes of data in a way that can dramatically reduce storage cost, while still being able to generate BI and comprehensive reports.
Big Data solutions typically target scenarios where there is a huge volume of unstructured or semi- structured data that must be stored and queried to extract business intelligence. Typically, 85% of data currently stored in Big Data solutions is unstructured or semi-structured. In terms of the variety of data, organizations often collect unstructured data, which is not in a format that suits relational database systems. Some data, such as web server logs and responses to questionnaires may be preprocessed into the traditional row and column format. However, data such as emails, web site comments and feedback, and tweets are semi-structured or even unstructured data. Deciding how to store this data using traditional database systems is problematic, and may result in loss of useful information if the data must be constricted to a specific schema when it is stored. Finally, the velocity of Big Data may make storage in an enterprise data warehouse problematic, especially where formal preparation processes such as examining, conforming, cleansing, and transforming the data must be accomplished before it is loaded into the data warehouse tables. The combination of all these factors means that a Big Data solution may be a more practical proposition than a traditional relational database system in some circumstances. However, as Big Data solutions have continued to evolve, it has become clear that they can also be used in a fundamentally different context: to quickly get insights into data, and to provide a platform for further investigation in a way that just isnt possible with traditional data storage, management, and querying tools. Figure 1 demonstrates how you might go from a semi-intuitive guess at the kind of information that might be hidden in the data, to a process that incorporates that information into your business process.
Figure 1 Big Data solutions as an experimental data investigation platform As an example, you may want to explore the postings by users of a social website to discover what they are saying about your company and its products or services. Using a traditional BI system would mean waiting for the database architect and administrator to update the schemas and models, cleanse and import the data, and design suitable reports. But its unlikely that youll know beforehand if the data is actually capable of providing any useful information, or how you might go about discovering it. By using a Big Data solution you can investigate the data by asking any questions that may seem relevant. If you find one or more that provide the information you need you can refine the queries, automate the process, and incorporate it into your existing BI systems. Big Data solutions arent all about business topics such as customer sentiment or webserver log file analysis. They have many diverse uses around the world and across all types of application. Police forces are using Big Data techniques to predict crime patterns, researchers are using them to explore the human genome, particle physicists are using them to search for information about the structure of matter, and astronomers are using them to plot the entire universe. Perhaps the last of these really is a big Big Data solution! How Is a Big Data Solution Different from Traditional Database Systems? Traditional database systems typically use a relational model where all the data is stored using predetermined schemas, and linked using the values in specific columns of each table. There are some more flexible mechanisms, such as the ability to store XML documents and binary data, but the capabilities for handling these types of data are usually quite limited. Big Data solutions do not force a schema onto the stored data. Instead, you can store almost any type of structured, semi-structured, or unstructured data and then apply a suitable schema only when you query this data. Big Data solutions are optimized for storing vast quantities of data using simple file formats and highly distributed storage mechanisms. Each distributed node is also capable of executing parts of the queries that extract information. Whereas a traditional database system would need to collect all the data from storage and move it to a central location for processing, with the consequent limitations of processing capacity and network latency, Big Data solutions perform the initial processing of the data at each storage node. This means that the bulk of the data does not need to be moved over the network for processing (assuming that you have already loaded it into the cluster storage), and it requires only a fraction of the central processing capacity to execute a query. Relational databases systems require a schema to be applied when data is written, which may mean that some information hidden in the data is lost. Big Data solutions store the data in its raw format and apply a schema only when the data is read, which preserves all of the information within the data. Figure 2 shows some of the basic differences between a relational database system and a Big Data solution in terms of storing and querying data. Notice how both relational databases and Big Data solutions use a cluster of servers; the main difference is where query processing takes place and how the data is moved across the network.
Figure 2 Some differences between relational databases and Big Data solutions Modern data warehouse systems typically use high speed fiber networks, and in-memory caching and indexes, to minimize data transfer delays. However, in a Big Data solution only the results of the distributed query processing are passed across the cluster network to the node that will assemble them into a final results set. Performance during the initial stages of the query is limited only by the speed and capacity of connectivity to the co-located disk subsystem, and this initial processing occurs in parallel across all of the cluster nodes. The servers in a cluster are typically co-located in the same datacenter and connected over a low-latency, high-bandwidth network. However, Big Data solutions can work well even without a high capacity network, and the servers can be more widely distributed, because the volume of data moved over the network is much less than in a traditional relational database cluster. The ability to work with highly distributed data and simple file formats also opens up opportunities for more efficient and more comprehensive data collection. For example, services and applications can store data in any of the predefined distributed locations without needing to preprocess it or execute queries that can absorb processing capacity. Data is simply appended to the files in the data store. Any processing required on the data is done when it is queried, without affecting the original data and risking losing valuable information. The following table summarizes the major differences between a Big Data solution and existing relational database systems.
Feature Relational database systems Big Data solutions Data types and formats Structured Semi-structured and unstructured Data integrity High - transactional updates Low - eventually consistent Schema Static - required on write Dynamic - optional on read and write Read and write pattern Fully repeatable read/write Write once, repeatable read Storage volume Gigabytes to terabytes Terabytes, petabytes, and beyond Scalability Scale up with more powerful hardware Scale out with additional servers Data processing distribution Limited or none Distributed across cluster Economics Expensive hardware and software Commodity hardware and open source software
What Technologies Do Big Data Solutions Encompass? Having talked in general terms about Big Data, it's time to be more specific about the hardware and software that it encompasses. The core of Big Data is an open source technology named Hadoop. Hadoop was developed by a team at the Apache Software Foundation. It is commonly understood to contain three main assets: The Hadoop kernel library that contains the common routines and utilities. The distributed file system for storing data files and managing clusters; usually referred to as HDFS (highly distributed file system). The map/reduce library that provides the framework to run queries against the data.
In addition there are many other open source components and tools that can be used with Hadoop. The Apache Hadoop website lists the following: Avro: A data serialization system. Cassandra: A scalable multi-master database with no single points of failure. Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. HCatalog: A table and storage management service for data created using Apache Hadoop. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout: A Scalable machine learning and data mining library. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper: A high-performance coordination service for distributed applications.
We'll be exploring and using some of these components in the scenarios described in subsequent chapters of this guide. For more information about all of them, see the Apache Hadoop website. However, this guide concentrates on Microsoft's implements a Big Data solution, named HDInsight, and specifically on the cloud-based implementation of HDInsight in Windows Azure. Youll find a more detailed description of HDInsight later in this chapter. You will also see some general discussion of Big Data and Hadoop technologies and techniques throughout the rest of this guide. How Does a Big Data Solution Work? Big Data solutions are essentially a simple process based on storing data as multiple replicated disk files on a distributed set of servers, and executing a multistep query that runs on each server in the cluster to extract the results from each one and then combine these into the final results set. The Data Cluster Big Data solutions use a cluster of servers to store and process the data. Each member server in the cluster is called a data node, and contains a data store and a query execution engine. The cluster is managed by a server called the name node, which has knowledge of all the cluster servers and the files stored on each one. The name node server does not store any data, but is responsible for allocating data to the other cluster members and keeping track of the state of each one by listening for heartbeat messages. To store incoming data, the name node server directs the client to the appropriate data node server. The name node also manages replication of data files across all the other cluster members, which communicate with each other to replicate the data. The data is divided into chunks and three copies of each data file are stored across the cluster servers in order to provide resilience against failure and data loss (the chunk size and the number of replicated copies are configurable for the cluster).
Its vital to maintain a backup of the name node server in a cluster. Without it, all of the cluster data may be lost. Windows Azure HDInsight stores the data in blob storage to protect against this kind of disaster. In some implementations there is only a single name node server, which results in a single point of failure for the entire cluster. If the name node server fails, all knowledge of the location of the cluster servers and data may be lost. To avoid this, Big Data implementations usually contain a replication or backup feature for the name node server (often referred to as the secondary name node), which allows the cluster to survive a failure. The Data Store The data store running on each server in a cluster is a suitable distributed storage service such as HDFS or a compatible equivalent; Windows Azure HDInsight provides a compatible service over Windows Azure blob storage. A core principal of Big Data implementations is that data is co-located on the same server (or in the same datacenter) as the query execution engine that will process it in order to minimize bandwidth use and maximize query performance. Windows Azure HDInsight (and some other virtualized environments) does not completely follow this principle because the data is stored in the datacenter storage cluster that is co-located with virtualized servers. The data store in a Big Data implementation is usually referred to as a NoSQL store, although this is not technically accurate because some implementations do support a SQL-like query language. In fact, some people prefer to use the term Not Only SQL for just this reason. There are more than 120 NoSQL data store implementations available at the time of writing, but they can be divided into the following four basic categories: Key/value stores. These are data stores that hold data as a series of key/value pairs. The value may be a single data item or a complex data structure. There is no fixed schema for the data, and so these types of data store are ideal for unstructured data. An example of a key/value store is Windows Azure table storage, where each row has a key and a property bag containing one or more values. Key/value stores can be persistent or volatile. Document stores. These are data stores optimized to store structured, semi-structured, and unstructured data items such as JSON objects, XML documents, and binary data. An example of a document store is Windows Azure blob storage, where each item is identified by a blob name within a virtual folder structure. Wide column or Column Family data stores. These are data stores that do use a schema, but the schema can contain families of columns rather than actual columns. They are ideally suited to storing semi- structured data where some columns can be predefined but others are capable of storing differing elements of unstructured data. HBase running on HDFS is an example. Graph data stores. These are data stores that hold the relationships between objects. They are less common than the other types of data store, and tend to have only specialist uses.
NoSQL storage is typically much cheaper than relational storage, and usually supports a Write Once capability that allows only for data to be appended. To update data in these stores you must drop and recreate the relevant file. This limitation maximizes throughput; Big Data storage implementations are usually measured by throughput rather than capacity because this is usually the most significant factor for both storage and query efficiency. Modern data handling techniques, such as the Command Query Responsibility Separation (CQRS) and other patterns, do not encourage updates to data. Instead, new data is added and milestone records are used to fix the current state of the data at intervals. This approach provides better performance and maintains the history of changes to the data. For more information about CQRS see the patterns & practices guide CQRS Journey. The Query Mechanism Big Data queries are based on a distributed processing mechanism called MapReduce that provides optimum performance across the servers in a cluster. Figure 3 shows a high-level overview of an example of this process that sums the values of the properties of each data item that has a key value of A.
Figure 3 A high level view of the Big Data process for storing data and extracting information
For a detailed description of the MapReduce framework, see MapReduce.org and Wikipedia. The query uses two components, usually written in Java, which implement algorithms that perform a two-stage data extraction and rollup process. The Map component runs on each data node server in the cluster extracting data that matches the query, and optionally applying some processing or transformation to the data to acquire the required result set from the files on that server. The Reduce component runs on one of the data node servers, and combines the results from all of the Map components into the final results set. Depending on the configuration of the query job, there may be more than one Reduce task running. The output from each Map task is stored in a common buffer, sorted, and then passed to one or more Reduce tasks. Intermediate results are stored in the buffer until the final Reduce task combines them all. Although the core Hadoop engine requires the Map and Reduce components it executes to be written in Java, you can use other techniques to create them in the background without writing Java code. For example you can use the tools named Hive and Pig that are included in most Big Data frameworks to write queries in a SQL-like or a high-level language. You can also use the Hadoop streaming API to execute components written in other languagessee Hadoop Streaming on the Apache website for more details. The following chapters of this guide contain more details of the way that Big Data solutions work, and guidance on using them. For example, they discusses topics such as choosing a data store, loading the data, the common techniques for querying data, and optimizing performance. The guide also describes the end-to-end process for using Big Data frameworks, and patterns such as ETL (extract-transform-load) for generating BI information. What Are the Limitations of Big Data Solutions? Modern relational databases are typically optimized for fast and efficient query processing using techniques such as Structured Query Language (SQL), whereas Big Data solutions are optimized for reliable storage of vast quantities of data. The typically unstructured nature of the data, the lack of predefined schemas, and the distributed nature of the storage in Big Data solutions often preclude major optimization for query performance. Unlike SQL queries, which can use intelligent optimization techniques to maximize query performance, Big Data queries typically require an operation similar to a table scan. Therefore, Big Data queries are typically batch operations that, depending on the data volume and query complexity, may take considerable time to return a final result. However, when you consider the volumes of data that Big Data solutions can handle, the fact that queries run as multiple tasks on distributed servers does offer a level of performance that cannot be achieved by other methods. Unlike most SQL queries used with relational databases, Big Data queries are typically not executed repeatedly as part of an applications execution, and so batch operation is not a major disadvantage.
Big Data queries are batch operations that may take some time to execute. Its possible to perform real- time queries, but typically you will run the query and store the results for use within your existing BI tools and analytics systems. Will a Big Data Solution Replace My Relational Database? The relational database systems we use today and the more recent Big Data solutions are complementary mechanisms; Big Data is unlikely to replace the existing relational database. In fact, in the majority of cases, it complements and augments the capabilities for managing data and generating BI.
Figure 4 Combining Big Data with a relational database For example, it's common to use a Big Data query to generate a result set that is then stored in a relational database for use in the generation of BI, or as input to another process. Big Data is also a valuable tool when you need to handle data that is arriving very quickly, and which you can process later. You can dump the data into the Big Data storage cluster in its original format, and then process it on demand using a Big Data query that extracts the required result set and stores it in a relational database, or makes it available for reporting. Figure 4 shows this approach in schematic form. Chapter 2 of this guide explores in more depth and demonstrates the integration of a Big Data solution with existing data analysis tools and enterprise BI systems. It describes the three main stages of integration, and explains the benefits of each approach in terms of maximizing the usefulness of your Big Data implementations. Is Big Data the Right Solution for Me? The first step in evaluating and implementing any business policy, whether its related to computer hardware, software, replacement office furniture, or the contract for cleaning the windows, is to determine the results that you hope to achieve. Adopting a Big Data solution is no different. The result you want from a Big Data solution will typically be better information that helps you to make data- driven decisions about the organization. However, to be able to get this information, you must evaluate several factors such as: Where will the source data come from? Perhaps you already have the data that contains the information you need, but you cant analyze it with your existing tools. Or is there a source of data you think will be useful, but you dont yet know how to collect it, store it, and analyze it? What is the format of the data? Is it highly structured, in which case you may be able to load it into your existing database or data warehouse and process it there? Or is it semi-structured or unstructured, in which case a Big Data solution that is optimized for textual discovery, categorization, and predictive analysis will be more suitable? What are the delivery and quality characteristics of the data? Is there a huge volume? Does it arrive as a stream or in batches? Is it of high quality or will you need to perform some type of data cleansing and validation of the content? Do you want to combine the results with data from other sources? If so, do you know where this data will come from, how much it will cost if you have to purchase it, and how reliable this data is? Do you want to integrate with an existing BI system? Will you need to load the data into an existing database or data warehouse, or will you just analyze it and visualize the results separately?
The answers to these questions will help you decide whether a Big Data solution is appropriate. As you saw earlier in this chapter, Big Data solutions are primarily suited to situations where: You have large volumes of data to store and process. The data is in a semi-structured or unstructured format, often as text files or binary files. The data is not well categorized; for example, similar items are described using different terminology such as a variation in city, country, or region names, and there is no obvious key value. The data arrives rapidly as a stream, or in large batches that cannot be processed in real time, and so must be stored efficiently for processing later as a batch operation. The data contains a lot of redundancy or duplication. The data cannot easily be processed into a format that suits existing database schemas without risking loss of information. You need to execute complex batch jobs on a very large scale, so that running the jobs in parallel is necessary. You want to be able to easily scale the system up or down on demand. You dont actually know how the data might be useful, but you suspect that it will be - either now or in the future.
If you decide that you do need a Big Data solution, the next step is to evaluate and choose a platform. There are several you can choose from, some of which are delivered as cloud services and some that you run on your own on-premises or hosted hardware. In this guide you will see solutions implemented using Microsofts cloud hosted solution, Windows Azure HDInsight. Choosing a cloud-hosted Big Data platform makes it easy to scale out or scale in your solution without incurring the cost of new hardware or have existing hardware underused. What Is Microsoft HDInsight? Microsoft implements a Big Data solution through an implementation based on Hadoop, and called HDInsight. HDInsight is 100% Apache Hadoop compatible and built on open source components in conjunction with HortonWorks; and is compatible with open source community distributions. All of the components are tested in typical scenarios to ensure that they work together correctly, and that there are no versioning or compatibility issues. Developments in HDInsight are fed back into community through HortonWorks to maintain compatibility and to support the open source effort. At the time this guide was written HDInsight is available in two forms: Windows Azure HDInsight. This is a service available to Windows Azure subscribers that uses Windows Azure clusters, and integrates with Windows Azure storage. An ODBC connector is available to connect the output from HDInsight queries to data analysis tools. Microsoft HDInsight Developer Preview. This is a single node development environment framework that you can install on most versions of Windows.
You can obtain a full multi-node version of Hadoop from HortonWorks that runs on Windows Server. For more information see Microsoft + Hortonworks. This guide is based on the May 2013 release of HDInsight on Windows Azure. Later releases of HDInsight may differ from the version described in this guide. For more information, see Microsoft Big Data. To sign up for the Windows Azure service, go to HDInsight service home page. Figure 5 shows the main component parts of HDInsight.
Figure 5 The main component parts of Microsoft HDInsight For an up to date list of the components included with HDInsight and the specific versions, see What version of Hadoop is in Windows Azure HDInsight? HDInsight supports queries and MapReduce components written in C# and JavaScript; and also includes a dashboard for monitoring clusters and creating alerts. The locally installable package that you can use for developing, debugging, and testing your Big Data solutions runs on a single Windows computer and integrates with Visual Studio. You can also download the Microsoft .NET SDK For Hadoop from Codeplex or install it with NuGet. At the time of writing this contained a .NET MapReduce implementation, Linq To Hive, PowerShell cmdlets, plus the web client applications WebHDFS and WebHCat, and clients for Oozie and Ambari. For local development of HDInsight solutions you can download and install a single node version of HDInsight Server that will run on most versions of Windows. This allows you to develop and test your solutions locally before you upload them to Windows Azure HDInsight. You can download the Microsoft HDInsight Developer Preview version from the Microsoft Download Center. The Data Store and Execution Platform Big Data solutions store data as a series of files located within a familiar folder structure on disk. In Windows Azure HDInsight these files are stored in Windows Azure blob storage. In the developer preview version the files are stored on the local computers disk in a virtual store. The execution platform is common across both Windows Azure HDInsight and the developer preview version, and so working with it is a similar experience. The Windows Azure HDInsight portal and the developer preview desktop provide an interactive environment in which you run the tools such as Hive and Pig. You can use the Sqoop connector to import data from a SQL Server database or Windows Azure SQL Database into your cluster. Alternatively you can upload the data using the JavaScript interactive console, a command window, or any other tools for accessing Windows Azure blob storage. Chapter 3 of this guide provides more details of how you can upload data to an HDInsight cluster. The Query and Analysis Tools HDInsight contains implementations of the Hive and Pig tools. Hive allows you to overlay a schema onto the data when you need to run a query, and use a SQL-like language for these queries. For example, you can use the CREATE TABLE command to build a table by splitting the text strings in the data using delimiters or at specific character locations, and then execute SELECT statements to extract the required data. Pig allows you to create schemas and execute queries using Java or by writing queries in a high level language called Pig Latin. It supports a fluent interface in the JavaScript console, which means that you can write queries such as from("input.txt", "date, product, qty, price").select("product, price").run() . HDInsight also supports writing map and reduce components that directly execute within the Hadoop core. As an alternative to using Java to create these components you can use the Hadoop streaming interface to execute map and reduce components written in C# and F#, and you can use JavaScript in Pig queries. You can, of course, use other open source tools such as Mahout with HDInsight, even though they are not included as part of an HDInsight deployment. Mahout allows you to perform data mining queries that examine data files to extract specific types of information. For example, it supports recommendation mining (finding users preferences from their behavior), clustering (grouping documents with similar topic content), and classification (assigning new documents to a category based on existing categorization). For more information about using HDInsight, a good place to start is the Microsoft TechNet library. You can see a list of articles related to HDInsight by searching the library using this URL: http://social.technet.microsoft.com/Search/en-US?query=hadoop. Blogs and forums for HDInsight include http://prologika.com/CS/blogs/blog/archive/tags/Hadoop/default.aspx and http://social.msdn.microsoft.com/Forums/en-US/hdinsight/threads. The remaining chapters of this guide provide more details of how you can perform queries and data analysis. HDInsight and the Microsoft Data Platform It may appear from what youve seen so far that HDInsight is a separate, stand-alone system for storing and analyzing large volumes of data. However, this is definitely not the caseit is a significant component of the Microsoft data platform; and part of the overall data acquisition, management, and visualization strategy.
Figure 6 The Role of HDInsight as part of the Microsoft Data Platform Figure 6 shows an overview of the Microsoft data platform, and the role HDInsight plays within this. This figure does not include all of Microsofts data-related products, and it doesnt attempt to show physical data flows. Instead, it illustrates the applications, services, tools, and frameworks that work together allow you to capture data, store it, and visualize the information it contains.As an example, consider the case where you want to simply load some unstructured data into Windows Azure HDInsight, combine it with data from external sources such as Windows Data Market, and then analyze and visualize the results using Microsoft Excel and Power View. Figure 7 shows this approach, and how it maps to elements of the Microsoft data platform.
Figure 7 Simple data querying, analysis, and visualization using Windows Azure HDInsight However, you may decide that you want to implement deeper integration of the data you store and query in HDInsight with your enterprise BI system. Figure 8 shows an example where streaming data collected from device sensors is fed through StreamInsight for categorization and filtering, and then dumped into a Windows Azure HDInsight cluster. The output from queries that are run as a periodic batch jobs in HDInsight is integrated at the corporate data model level with a data warehouse, and ultimately delivered to users through SharePoint libraries and web parts, making it available for use in reports and in data analysis and visualization tools such as Microsoft Excel.
Figure 8 Integrating HDInsight with corporate models in an enterprise BI system Alternatively, you may want to integrate HDInsight as a business data source for an existing data warehouse system, perhaps to produce a set of specific management reports on a regular basis. Figure 9 shows how you can take semi-structured or unstructured data and query it with HDInsight, validate and cleanse the results using Data Quality Services, and store them in your data warehouse tables ready for use in reports.
Figure 9 Querying, validating, cleansing, and storing data for use in reports These are just three examples of the countless permutations and capabilities of the Microsoft data platform and HDInsight. Your own requirements will differ, but the combination of services and tools makes it possible to implement almost any kind of Big Data solution using the elements of the platform. In the following chapters of this guide you will see many examples of the way that these tools and services work together. In the following chapters you will learn about the technical implementation of a Big Data solution using HDInsight, including options for storing the data, the tools for performing queries, and the technicalities of connecting the data to BI systems and visualization tools. Where Do I Go Next? The remainder of this guide explores Big Data and HDInsight from several different perspectives. You may wish to read the chapters in order to see how to design, implement, manage, and automate a Big Data solution. However, if you have a particular area of interest, you may prefer to jump directly to the appropriate chapter. Use the following table to help you decide. Topic area Chapter Architecture and design of a Big Data solution, and an overview of the development life cycle. Chapter 2, Getting Started with HDInsight Patterns and implementations for loading data into a Big Data solution, including techniques for data compression. Chapter 3, Collecting and Loading Data Using the query tools, map/reduce components, user-defined functions; evaluating the results; and performance tuning your queries. Chapter 4, Performing Queries and Transformations Loading and using the results of an HDInsight process in visualization and analysis tools, data stores, and BI systems. Chapter 5 Consuming and Visualizing the Results Debugging and testing Big Data queries, and using tools and frameworks to automate execution of HDInsight processes. Chapter 6, Debugging, Testing, and Automating HDInsight Solutions Using the Hadoop and HDInsight monitoring tools; and managing and automating HDInsight administrative tasks. Chapter 7 Managing and Monitoring HDInsight Using HDInsight as a tool to explore data and find previously unknown information and insights. Chapter 8, Scenario 1: Iterative Data Exploration Using HDInsight to perform Extract, Transform, and Load operations to validate and cleanse incoming data. Chapter 9, Scenario 2: Extract, Transform, Load Using HDInsight as a commodity data store or as an online data warehouse to minimize on-premises costs. Chapter 10, Scenario 3: HDInsight as a Data Warehouse Integrating an HDInsight solution with an enterprise BI system at each of the three typical integration levels. Chapter 11, Scenario 4: Enterprise BI Integration Implementing the three common patterns for data ingestion in a Big Data solution. Appendix A, Data Ingestion Patterns using the Twitter API Overview of the structure and usage of an enterprise BI system, as background to the scenarios in the guide. Appendix B, Enterprise BI Integration Summary In this introductory chapter you have discovered what a Big Data solution is, why you may want to implement one, and how it is different from the use of traditional relational databases and business intelligence tools. You have seen an overview of how Big Data solutions work, and a brief description of the different frameworks and platforms you can choose from. This guide focuses on Microsofts cloud based Big Data implementation, called Windows Azure HDInsight. HDInsight uses a 100% compatible open source implementation of the core Hadoop library and tools, as well as providing additional capabilities for writing queries in a range of languages. Finally, the chapter introduced some of the typical scenarios for using a Big Data solution, and showed how Big Data solutions that use HDInsight fit into the Microsoft data platform. The scenarios are covered in detail on subsequent chapters of the guide. However, before then, the following chapter shows how to get started using Windows Azure HDInsight. More Information The official site for Apache Big Data solutions and tools is the Apache Hadoop website. For a detailed description of the MapReduce framework and programming model, see MapReduce.org and Wikipedia. For an overview and description of Microsoft HDInsight see Microsoft Big Data. To sign up for the Windows Azure service, go to HDInsight Service. You can download the HDInsight developer preview from the Microsoft Download Center. The Microsoft .NET SDK For Hadoop is available from Codeplex or you can install it with NuGet. The Microsoft TechNet library contains articles related to HDInsight. Search for these using the URL http://social.technet.microsoft.com/Search/en-US?query=hadoop. The official support forum for HDInsight is at http://social.msdn.microsoft.com/Forums/en- US/hdinsight/threads. There are also several popular blogs that cover Big Data and HDInsight topics: http://prologika.com/CS/blogs/blog/archive/tags/Hadoop/default.aspx http://dennyglee.com/ http://blogs.msdn.com/b/hpctrekker/
Chapter 2: Getting Started with HDInsight In the previous chapter you learnt how Big Data solutions such as Microsoft HDInsight can help you discover vital information that may otherwise have remained hidden in your dataor even lost forever. This information can help you to evaluate your organizations historical performance, discover new opportunities, identify operational efficiencies, increase customer satisfaction, and even predict likely outcomes for the future. Its not surprising that Big Data is generating so much interest and excitement in what some may see as the rather boring world of data management! In this chapter the focus moves towards the practicalities of implementing a Big Data solution by considering in more detail how you go about this. This means you need to think about what you hope to achieve from the implementation, even if your aim is just to explore the kinds of information you might be able to extract from available data, and how HDInsight fits into your existing business infrastructure. It may be that you just want to use it alongside your existing business intelligence (BI) systems, or you may want to deeply integrate it with these systems. The important point is that, irrespective of how you choose to use it, the end result is the same: some kind of analysis of the source data and meaningful visualization of the results. This chapter begins with an overview of the main stages of implementing a Big Data solution. It then explores the decisions you should make before you start, such as identifying your goals and the source data you might use, before moving on to describe the basic architectural approaches that you might choose to implement.
Many organizations already use data to improve decision making through existing BI solutions that analyze data generated by business activities and applications, and create reports based on this analysis. Rather than seeking to replace traditional BI solutions, HDInsight provides a way to extend the value of your investment in BI by enabling you to incorporate a much wider variety of data sources that complement and integrate with existing data warehouse, analytical data models, and business reporting solutions. An Overview of the Big Data Process Big Data implementation typically involves a common series of stages, irrespective of the type of data and the ultimate aims for obtaining information from that data. The stages are: Determine the analytical goals and source data. Before you start any data analysis project, it is useful to be clear about what it is you hope to achieve. You may have a specific question that you need to answer in order to make a critical business decision; in which case you must identify data that may help you determine the answer, where it can be obtained from, and if there are any costs associated with procuring it. Alternatively, you may already have some data that you want to explore to try to discern useful trends and patterns. Either way, understanding your goals will help you design and implement a solution that best supports those goals.
Plan and configure the infrastructure. This involves setting up the Big Data software you have decided to use, or subscribing to an online service such as Windows Azure HDInsight. You may also consider integrating your solution with an existing database or BI infrastructure. Obtain the data and submit it to the cluster. During this stage you decide how you will collect the data you have identified as the source, and how you will get it into your Big Data solution for processing. Often you will store the data in its raw format to avoid losing any useful contextual information it contains, though you may choose to do some pre-processing before storing it to remove duplication or to simplify it in some other way. Process the data. After you have started to collect and store the data, the next stage is to develop the processing solutions you will use to extract the information you need. While you can usually use Hive and Pig queries for even quite complex data extraction, you will occasionally need to create map/reduce components to perform more complex queries against the data. Evaluate the results. Probably the most important step of all is to ensure that you are getting the results you expected, and that these results make sense. Complex queries can be hard to write, and difficult to get right the first time. Its easy to make assumptions or miss edge cases that can skew the results quite considerably. Of course, it may be that you dont know what the expected result actually is (after all, the whole point of Big Data is to discover hidden information from the data) but you should make every effort to validate the query results before making business decisions based on them. In many cases, a business user who is familiar enough with the business context can perform the role of a data steward and review the results to verify that they are meaningful, accurate, and useful. Tune the solution. At this stage, if the solution you have created is working correctly and the results are valuable, you should decide whether you will repeat it in the future; perhaps with new data you collect over time. If so, you should tune the solution by reviewing the log files it creates, the processing techniques you use, and the implementation of the queries to ensure that they are executing in the most efficient way. Its possible to fine tune Big Data solutions to improve performance, reduce network load, and minimize the processing time by adjusting some parameters of the query and the execution platform, or by compressing the data that is transferred over the network. Visualize and analyze the results. Once you are satisfied that the solution is working correctly and efficiently, you can plan and implement the analysis and visualization approach you require. This may be loading the data directly into an application such as Microsoft Excel, or exporting it into a database or enterprise BI system for further analysis, reporting, charting, and more. Automate the solution. At this point is will be clear if the solution should become part of your organizations business management infrastructure, complementing the other sources of information that you use to plan and monitor business performance and strategy. If this is the case, you should consider how you might automate some or all of the solution to provide predictable behavior, and perhaps so that it is executed on a schedule.
Administer the solution. Any business process on which your business depends should be monitored for errors and failures, and a Big Data solution is no different. After a solution becomes an integral part of your organizations business management infrastructure, you should put in place systems and procedures that allow administrators to easily manage and monitor it.
Its easy to run queries that extract data, but its vitally important that you make every effort to validate the results before using them as the basis for business decisions. If possible you should try to cross reference the results with other sources of similar information. Integrating a Big Data Process with Your Existing Systems If you find that the Big Data process you have implemented does produce a meaningful, valid, and useful result the next stage is to consider whether you want to integrate it with your existing data management and reporting systems. Unless you intend to carry out the process manually each time, integration typically means Figure 1 The stages of a Big Data solution development and map to the chapters of this guide
Figure 1 shows how these development stages map to the chapters of this guide. Note that, in many ways, data analysis is an iterative process; and you should take this approach when building a Big Data solution. In particular, given the large volumes of data and correspondingly long processing times typically involved in Big Data analysis, it can be useful to start by implementing a proof of concept iteration in which a small subset of the targeted data is used to validate the processing steps and results before proceeding with a full analysis. This enables you to test your Big Data processing design on a small HDInsight cluster or a single- node on-premise solution before scaling out to accommodate production-level data volumes.
automating some or all of the data loading, querying, and export stages. There are many tools, frameworks, and other techniques you can use to accomplish this. While the guide cannot cover every eventuality, you will find more information in Chapter 6 to help you to achieve this. In addition, after a Big Data solution becomes an integral part of your internal information systems, you must consider how you will monitor and manage it to ensure that it continues to work correctly and efficiently. For example, you must consider how you will detect failures, and how you will ensure that the solution is secure and reliable. Chapter 7 of this guide explores these concepts. Determining Analytical Goals and Source Data Before embarking on a Big Data project it is generally useful to think about what you hope to achieve, and clearly define the analytical goals of the project. In some projects there may be a specific question that the business wants to answer, such as where should we open our new store? In other projects the goal may be more open-ended; for example to examine website traffic and try to detect patterns and trends in visitor numbers. Understanding the goals of the analysis can help you make decisions about the design and implementation of the solution, including the specific technologies to use and the level of integration with existing BI infrastructure. Historical and Predictive Data Analysis Most organizations already collect data from multiple sources. These might include line of business applications such as websites, accounting systems, and office productivity applications. Other data may come from interaction with customers, such as sales transactions, feedback, and reports from company sales staff. The data is typically held in one or more databases, and is often consolidated into a specially designed data warehouse system. Having collected this data, organizations typically use it to perform the following kinds of analysis and reporting. Historical analysis and reporting, which is concerned with summarizing data to make sense of what happened in the past. For example, a business might summarize sales transactions by fiscal quarter and sales region, and use the results to create a report for shareholders. Additionally, business analysts within the organization might explore the aggregated data by drilling down into individual months to determine periods of high and low sales revenue, or drilling down into cities to find out if there are marked differences in sales volumes across geographic locations. The results of this analysis can help to inform business decisions, such as when to conduct sales promotions or where to open a new store. Predictive analysis and reporting, which is concerned with detecting data patterns and trends to determine whats likely to happen in the future. For example, a business might use statistics from historical sales data and apply it to known customer profile information to predict which customers are most likely to respond to a direct-mail campaign, or which products a particular customer is likely to want to purchase. This analysis can help improve the cost-effectiveness of a direct-mail campaign, or increase sales while building closer customer relationships through relevant targeted recommendations.
Both kinds of analysis and reporting involve taking source data, applying an analytical model to that data, and using the output to inform business decision making. In the case of historical analysis and reporting, the model is usually designed to summarize and aggregate a large volume of data to determine meaningful business measures (for example the total sales revenue aggregated by various aspects of the business, such as fiscal period and sales region). For predictive analysis, the model is usually based on a statistical algorithm that categorizes clusters of similar data or correlates data attributes that influence one another to cause trends (for example, classifying customers based on demographic attributes, or identifying a relationship between customer age and the purchase of specific products). Databases are the core of most organizations data processing, and in most cases the purpose is simply to run the operation by, for example, storing and manipulating data to manage stock and create invoices. However, analytics and reporting is one of the fastest growing sectors in business IT as managers strive to learn more about their business. Analytical Goals Although every project has its own specific requirements, Big Data projects generally fall into one of the following categories: One-time analysis for a specific business decision. For example, a company planning to expand by opening a new physical store might use Big Data techniques to analyze demographic data for a shortlist of proposed store sites to determine the location that is likely to result in the highest revenue for the store. Alternatively, a charity planning to build water supply infrastructure in a drought-stricken area might use a combination of geographic, geological, health, and demographic statistics to identify the best locations to target. Open blue sky exploration of interesting data. Sometimes the goal of Big Data analysis is simply to find out what you dont already know from the available data. For example, a business might be aware that customers are using Twitter to discuss its products and services, and want to explore the tweets to determine if any patterns or trends can be found relating to brand visibility or customer sentiment. There may be no specific business decision that needs to be made based on the data, but gaining a better understanding of how customers perceive the business might inform business decision making in the future. Ongoing reporting and BI. In some cases the Big Data solution will be used to support ongoing reporting and analytics, either in isolation or integrated with an existing enterprise BI solution. For example, a real estate business that already has a BI solution, which enables analysis and reporting of its own property transactions across time periods, property types, and locations, might be extended to include demographic and population statistics data from external sources. Integration patterns for HDInsight and enterprise BI solutions are discussed later in this chapter.
In many respects, data analysis is an iterative process. It is not uncommon for an initial project based on open exploration of data to uncover trends or patterns that form the basis for a new project to support a specific business decision, or to extend an existing BI solution.
The results of the analysis are typically consumed and visualized in the following ways. Custom application interfaces. For example, a custom application might display the data as a chart or generate a set of product recommendations for a customer. Business performance dashboards. For example, you could use the PerformancePoint Services component of SharePoint Server to display key performance indicators (KPIs) as scorecards and summarized business metrics in a SharePoint Server site. Reporting solutions such as SQL Server Reporting Services or Windows Azure SQL Reporting. For example, business reports can be generated in a variety of formats and distributed automatically by email or viewed on demand through a web browser. Analytical tools such as Microsoft Excel. Information workers can explore analytical data models through PivotTables and charts, and business analysts can use advanced Excel capabilities such as PowerPivot and Power View to create their own personal data models and visualizations, or use add-ins to apply predictive models to data and view the results in Excel.
You will see more about the tools for analytics and reporting later in this chapter. Before then you will explore different ways of combining a Big Data mechanism such as HDInsight with your existing business processes and data. Source Data In addition to determining the analytical goals of the project, you must identify sources of data that can be used to meet those goals. Often you will already know which data sources need to be included in the analysis; for example, if the goal is to analyze trends in sales for the past three years you can use historic sales data from internal business applications or a data warehouse. However, in some cases you may need to search for data to support the analysis you want to perform; for example, if the goal is to determine the best location in which to open a new store you may need to search for useful demographic data that covers the locations under consideration. It is common in Big Data projects to combine data from multiple sources and create a mash up that enables you to analyze many different aspects of the problem in a single solution. For example, you might combine internal historic sales data with geographic data obtained from an external source to plot sales volumes on a map. You may then overlay the map with demographic data to try to correlate sales volume with particular geo- demographic attributes.
Common types of data source used in a Big Data solution include: Internal business data from existing applications or BI solutions. Often this data is historic in nature or includes demographic profile information that the business gathered from its customers. For example, you might use historic sales records to correlate customer attributes with purchasing patterns, and then use this information to support targeted advertising or predictive modeling of future product plans. Log files. Applications or infrastructure services often generate log data that can be useful for analysis and decision making with regard to managing IT reliability and scalability. Additionally, in some cases, combining log data with business data can reveal useful insights into how IT services support the business. For example, you might use log files generated by Internet Information Services (IIS) to assess network bandwidth utilization, or to correlate web site traffic with sales transactions in an ecommerce application. Sensors. Increased automation in almost every aspect of life has led to a growth in the amount of data recorded by electronic sensors. For example, RFID tags in smart cards are now routinely used to track passenger progress through mass transit infrastructure, and sensors in plant machinery generate huge quantities of data in production lines. This availability of data from sensors enables highly dynamic analysis and real-time reporting. Social media. The massive popularity of social media services such as Twitter and others is a major factor in the growth of data volumes on the Internet. Many social media services provide application programming interfaces (APIs) that you can use to query the huge amount of data shared by users of these services, and consume that data for analysis. For example, a business might use Twitters query API to find tweets that mention the name of the company or its products, and analyze the data to determine how customers feel about the companys brand. Data feeds. Many web sites and services provide data as a feed that can be consumed by client applications and analytical solutions. Common feed formats include RSS, ATOM, and industry defined XML formats; and the data sources themselves include blogs, news services, weather forecasts, and financial markets data. Government and special interest groups. Many government organizations and special interest groups publish data that can be used for analysis. For example, the UK government publishes over 9000 downloadable datasets in a variety of formats, including statistics on population, crime, government spending, health, and other factors of life in the UK. Similarly, the US government provides census data and other statistics as downloadable datasets or in dBASE format on CD-ROM. Additionally, many international organizations provide data free of charge for example, the United Nations makes statistical data available through its own website and in the Windows Azure DataMarket. Commercial data providers. There are many organizations that sell data commercially, including geographical data, historical weather data, economic indicators, and others. The Windows Azure DataMarket provides a central service through which you can locate and purchase subscriptions to these data sources.
Just because data is available doesnt mean it is useful, or that the effort of using it will be viable. Think about the value the analysis can add to your business before you devote inordinate time and effort to collecting and analyzing it.
When planning data sources to use in your Big Data solution, consider the following factors: Availability. How easy is it to find and obtain the data? You may have a specific analytical goal in mind, but if the data required to support the analysis is difficult (or impossible) to find you may waste valuable time trying to obtain it. When planning a Big Data project, it can be useful to define a schedule that allows sufficient time to research what data is available. If the data cannot be found after an agreed deadline you may need to revise the analytical goals. Format. In what format is the data available, and how can it be consumed? Some data is available in standard formats and can be downloaded over a network or Internet API. In other cases the data may be available only as a real-time stream that you must capture and structure for analysis. Later in the process you will consider tools and techniques for consuming the data from its source and ingesting it into your cluster, but even during this early stage you should identify the format and connectivity options for the data sources you want to use. Relevance. Is the data relevant to the analytical goals? You may have identified a potential data source and already be planning how you will consume it and ingest it into the analytical process. However, you should first examine the data source carefully and ensure that the data it contains is relevant to the analysis that needs to be performed. Cost. You may determine the availability of a relevant dataset, only to discover that the cost of obtaining the data outweighs the potential business benefit of using it. This can be particularly true if the analytical goal is to augment an enterprise BI solution with external data on an ongoing basis, and the external data is only available through a commercial data provider.
Planning and Configuring the Infrastructure Having identified the data sources you need to analyze, you must design and configure the infrastructure required to support the analysis. Considerations for planning an infrastructure for Big Data analysis include deciding whether to use an on-premises or a cloud-based implementation, understanding how file storage works, and how you can integrate the solution with an existing enterprise BI infrastructure if this is a requirement. This guide concentrates on Windows Azure HDInsight. However, the considerations for choosing an infrastructure option are typically relevant irrespective of the Big Data mechanism you choose. Choosing an Operating Location Big Data solutions are available as both cloud-hosted services and on-premises installable solution. For example, HDInsight is available as a Windows Azure service, but you can install the HortonWorks Hadoop distribution as an on-premises solution on Windows Server. Consider the following guidelines when choosing between these options: A cloud-hosted mechanism such as HDInsight on Windows Azure is a good choice when: You require the solution to be running for only a specific period of time. You require the solution to be available for ongoing analysis, but the workload will vary - sometimes requiring a cluster with many nodes and sometimes not requiring any service at all. You want to avoid the cost in terms of capital expenditure, skills development, and the time it takes to provision, configure, and manage an on-premises solution.
An on-premises solution such as Hadoop on Windows Server is a good choice when: You require ongoing services with a predictable and constant level of scalability. You have the necessary technical capability and budget to provision, configure, and manage your own cluster. The data you plan to analyze must remain on your own servers for compliance or confidentiality reasons.
If you want to run a full multi-cluster Hadoop installation on Windows Server you should consider the Hadoop distribution available from HortonWorks. This is the core Hadoop implementation used by HDInsight. For single node development and testing of Hadoop solutions on Windows you can download the Microsoft HDInsight Developer Preview and install it on most versions of Windows. Data Storage in Windows Azure HDInsight HDInsight on Windows Azure stores data in Azure Storage Vault (ASV) containers in Windows Azure blob storage. However, it supports Hadoop file system commands and processes through a fully HDFS-compliant layer over the ASV. As far as Hadoop is concerned, storage operates in exactly the same way as when using a physical HDFS implementation. The only real difference is that you can access the storage in the ASV using standard Windows Azure blob storage techniques as we as through the HDFS layer. It may seem as though using a separate storage mechanism from the HDFS cluster instances defeats the object of Hadoop, where one of the main tenets is that the data does not need to be moved because it is distributed amongst and stored on each server in the cluster. However, Windows Azure datacenters use flat network storage, which provides extremely fast, high bandwidth connectivity between storage and the virtual machines that make up an HDInsight cluster.
Using Windows Azure blob storage provides several advantages: Data storage costs are minimized because you can decommission a cluster when not performing queries; data in Windows Azure blob storage is persisted when the HDInsight cluster is released and you can build a new cluster on the existing source data in blob storage. Windows Azure blob storage is also considerably cheaper than the cost of storage attached to the HDInsight cluster members. Data transfer costs are minimized because you do not have to upload the data again over the Internet when you recreate a cluster that uses the same data; there is no transfer cost when loading an HDFS cluster from blob storage in the same datacenter. By default, when you delete and recreate a cluster, you will lose all of the metadata associated with the cluster such as Hive external table and index definitions. See the section Retaining Data and Metadata for a Cluster in Chapter 7 for information about how you can preserve and reinstate this metadata. Data stored in Windows Azure blob storage can be accessed by and shared with other applications and services, whereas data stored in HDFS can only be accessed by HDFS-aware applications that have access to the cluster storage. Data in Windows Azure blob storage is replicated across three locations in the datacenter, so it provides a similar level of redundancy to protect against data loss as an HDFS cluster. Blob storage can be used to store large volumes of data without being concerned about scaling out storage in a cluster, or changing the scaling in response to changes in storage requirements. The high speed datacenter network provides fast access between the virtual machines in the cluster, so data movement that occurs between nodes during the reduce stage is very efficient. Tests carried out by the Windows Azure team indicate that blob storage provides near identical performance to HDFS when reading data. However, Windows Azure blob storage provides faster write access than using HDFS to store data on the nodes in the cluster because write operations complete as soon as the single write operation back to the blob has completed. When using HDFS with data stored in the cluster, a write operation is not complete until all of the replicas (by default there are three) have completed.
However, its important to note that these advantages only apply when you locate the HDInsight cluster in the same datacenter as your ASV. If you choose different datacenters you will severely hamper performance, and you will also be charged for data transfer costs between datacenters.
For more information about the use of ASV instead of HDFS for data storage see the blog post Why use Blob Storage with HDInsight on Azure. For details of the flat network storage see Windows Azures Flat Network Storage and 2012 Scalability Targets.
By default, when you configure an HDInsight cluster, it uses a storage account located in the same datacenter. If this was not the case, your queries would be extremely inefficient and you would also be charged data transfer costs! Consider turning off the default geo-replication feature for the storage account unless the data is very difficult to collect or recreate and you need this redundancy. If the processing fails over to a remote storage account, query efficiency will be severely affected and you will incur additional transfer costs moving the data between datacenters.
Configuring an HDInsight Cluster You access Windows Azure HDInsight through the Windows Azure Management Portal at https://manage.windowsazure.com/. The first step is to create a new HDInsight cluster and, as with most other Windows Azure services, you have two options: Quick Create allows you to create a cluster simply by specifying a unique name for the cluster (which will be available at your-name.azurehdinsight.net), the cluster size (the default is four nodes), a password for the cluster (the default user name is admin), and an existing Windows Azure storage account in which HDInsight will store the cluster data. Custom Create allows you to specify a different user name from the default of admin, and select a container within your storage account that holds the source data for the cluster.
You dont need to perform any other configuration before you start using the cluster. However, there are many configuration settings that you can change to modify the way that the cluster works and to fine tune performance of your queries. For example, you can specify the number of map and reduce nodes for a query and apply compression to the data in a script, by adding parameters to the command that runs the query, or by editing the cluster configuration files.
For more information about setting configuration parameters see Chapter 7, Managing and Monitoring HDInsight, of this guide.
Choosing an Architectural Style Big Data solutions open up new opportunities for converting data into information. They can also be used to extend existing information systems to provide additional insights through analytics and data visualization. Every organization is different, and so there is no definitive list of ways that you can use a Big Data solution as part of your own business processes. However, there are four general architectural models. Understanding these will help you to start making decisions on how best to integrate Big Data solutions such as HDInsight with your organization, and with your existing BI systems and tools. The four different models are: As a standalone data analysis experimentation and visualization tool. This model is typically chosen for experimenting with data sources to discover if they can provide useful information, and for handling data that you cannot process using existing systems. For example, you might collect feedback from customers through email, web pages, or external sources such as social media sites, then analyze it to get a picture of user sentiment for your products. You might be able to combine this information with other data, such as demographic data that indicates population density and characteristics in each city where your products are sold. As a data transfer, data cleansing, or ETL mechanism. Big Data solutions can be used to extract and transform data before you load it into your existing databases or data visualization tools. Such solutions are well suited to performing categorization and normalization of data, and for extracting summary results to remove duplication and redundancy. This is typically referred to as an Extract, Transform, and Load (ETL) process. As a basic data warehouse or commodity storage mechanism. Big Data solutions allow you to store both the source data and the results of queries executed over this data. You can also store schemas (or, to be precise, metadata) for tables that are populated by the queries you execute. These tables can be indexed, although there is no formal mechanism for managing key-based relationships between them. However, you can create data repositories that are robust and reasonably low cost to maintain, which is especially useful if you need to store and manage huge volumes of data. Integration with an enterprise data warehouse and BI system. Enterprise-level data warehouses have some special characteristics that differentiate them from OLTP database systems, and so there are additional considerations for integrating with Big Data solutions. You can also integrate at different levels, depending on the way that you intend to use the data obtained from your Big Data solution.
The four scenario chapters later in this guide, and the associated Hands-on Labs that you can download from http://wag.codeplex.com/releases/view/103405, provide examples of each of these architectural styles. Each scenario explores how a fictitious organization implemented a solution using one of the architectural styles described in this chapter. Each of these basic models encompasses several more specific use cases. For example, when integrating with an enterprise data warehouse and BI system you need to decide at what level (report, corporate data model, or data warehouse level) you will integrate the output from your Big Data solutions. The following sections of this chapter explore the four models listed above in more detail, using HDInsight as the Big Data mechanism. You will see advice on when to choose each one, how you can use it to provide the information you need, and some of the typical use cases for each one. Using HDInsight as a Standalone Experimentation and Visualization Tool Traditional data storage and management systems such as data warehouses, data models, and reporting and analytical tools provide a wealth of information on which to base business decisions. However, while traditional BI works well for business data that can easily be structured, managed, and processed in a dimensional analysis model, some kinds of analysis require a more flexible solution that can derive meaning from less obvious sources of data such as log files, email messages, tweets, and others. Theres a great deal of useful information to be found in these less structured data sources, which often contain huge volumes of data that must be processed to reveal key data points. This kind of data processing is what Big Data solutions such as HDInsight were designed to handle. It provides a way to process extremely large volumes of unstructured or semi-structured data, often by performing complex computation and transformation of the data, to produce an output that can be visualized directly or combined with other datasets. If you do not intend to reuse the information from the analysis, but just want to explore the data, you may choose to consume it directly in an analysis or visualization tool such as Microsoft Excel. The standalone experimentation and visualization model is typically suited to the following scenarios: Handling data that you cannot process using existing systems, perhaps by performing complex calculations and transformations that are beyond the capabilities of existing systems to complete in a reasonable time. Collecting stream data that arrives at high rate or in large batches without affecting existing database performance. Collecting feedback from customers through email, web pages, or external sources such as social media sites, then analyzing it to get a picture of customer sentiment for your products. Combining information with other data, such as demographic data that indicates population density and characteristics in each city where your products are sold. Dumping data from your existing information systems so that you can work with it without interrupting other business processes or risking corruption of the original data. Trying out new ideas and validating processes before implementing them within the live system.
Combining your data with datasets available from Windows Azure Data Market or other commercial data sources can reveal useful information that might otherwise remain hidden in your data. Architecture Overview Figure 2 shows an overview of the architecture for a standalone data analysis experimentation and visualization solution using HDInsight. The source data files are loaded into the HDInsight cluster, processed by one or more queries within HDInsight, and the output is consumed by the chosen reporting and visualization tools.
Figure 2 High-level architecture for the standalone experimentation and visualization model Text files such as log files and compressed binary files can be loaded directly into the cluster storage, while stream data will usually need to be handled by a suitable mechanism such as StreamInsight. The output data may be combined with other datasets within your visualization and reporting tools to augment the information and provide comparisons, as you will see later in this section of the chapter. When using HDInsight as a standalone query mechanism, you will often do so as an interactive process. For example, you might use the Data Explorer in Excel to submit a query to an HDInsight cluster and wait for the results to be returned usually within a few seconds or even a few minutes. You can then modify and experiment with the query to optimize the information it returns. However, keep in mind that all Big Data operations are batch jobs that are submitted to all of the servers in the cluster for parallel processing, and queries can often take minutes or hours to complete. For example, you might use Pig to process an input file, with the results returned in an output file some time laterat which point you can perform the analysis by importing this file into your chosen visualization tool.
Data Sources for the Standalone Experimentation and Visualization Model The input data for HDInsight can be almost anything, but for this model they typically include the following: Social data, log files, sensors, and applications that generate data files. Datasets obtained from Windows Data Market and other commercial data providers. Internal data extracted from databases or data warehouses for experimentation and one-off analysis. Streaming data filtered and pre-processed through SQL Server StreamInsight.
Notice that, as well as externally obtained data, Figure 2 includes data from within your organizations existing database or data warehouse. HDInsight is an ideal solution when you want to perform offline exploration of existing data in a sandbox. For example, you may join several datasets from your data warehouse to create large datasets that act as the source for some experimental investigation, or to test new analysis techniques. This avoids the risk of interrupting existing systems, affecting performance of your data warehouse system, or accidently corrupting the core data. Often you need to perform more than one query on the data in HDInsight to get the results into the form you need. Its not unusual to base queries on the results of a preceding query; for example, using one query to select and transform the required data and remove redundancy, a second query to summarize the data returned from the first query, and a third query to format the output as required. This iterative approach enables you to start with a large volume of complex and difficult to analyze data, and get it into a structure that you can consume directly from an analytical tool such as Excelor, as youll see later, to use as the input into a managed BI solution. Output Targets for the Standalone Experimentation and Visualization Model The results from your HDInsight queries can be visualized using any of the wide range of tools that are available for analyzing data, combining it with other datasets, and generating reports. Typical examples for the standalone experimentation and visualization model are: HDInsight interactive reporting and charting. Interactive analytical tools such as Excel, PowerPivot, Power View, and GeoFlow. SQL Server Reporting Services using Report Builder. Custom or third party analysis and visualization tools.
You will see more details of these tools in Chapter 5, Consuming and Visualizing the Results, of this guide.
Considerations for the Standalone Experimentation and Visualization Model There are some important points to consider when choosing the standalone model for querying and visualizing data with HDInsight. This model is typically used when you want to: Experiment with new types or sources of data. Generate one-off reports or visualizations of external or internal data. Monitor a data source using visualizations to detect changes or to predict behavior. Combine with other data to generate comparisons or to augment the information. You will usually choose this model when you do not want to persist the results of the query after analysis, or after the required reports have been generated. It is typically used for one-off analysis tasks where the results are discarded after use; and so differs from the other models described in this chapter, in which the results are stored and reused. Very large datasets are likely to preclude the use of an interactive approach due to the time taken for the queries to run. However, after the queries are complete you can work interactively with the data to perform different types of analysis or visualization. Data arriving as a stream, such as the output from sensors on an automated production line or the data generated by GPS sensors in mobile devices cannot be analyzed directly by HDInsight. A typical technique is to capture it using a stream processing technology such as StreamInsight, which may power a real time visualization or rudimentary analysis tool, and also feed it into an HDInsight cluster where it can be queried interactively or as a batch job at suitable intervals. You are not limited to running a single query on the source data. You can follow an iterative pattern in which the data is passed through HDInsight multiple times, each pass refining the data until it is suitably prepared for use in your analytical tool. For example, a large unstructured file might be processed using custom MapReduce components to generate a smaller, more structured output file. This output could then be used in turn as the input for a Hive query that returns aggregated data in tabular form.
For an example of a scenario that uses the standalone experimentation and visualization architectural style see Scenario 1, Iterative Data Exploration, in Chapter 8 of this guide. Using HDInsight as a Data Transfer, Data Cleansing, or ETL Mechanism In a traditional business environment the data to power your reporting mechanism will usually come from tables in a database. However, its increasingly necessary to supplement this with data obtained from outside your organization. This may be commercially available datasets, such as those available from Windows Data Market and elsewhere, or it may be data from less structured sources such as social media, emails, log files, and more. You will, in most cases, need to cleanse, validate, and transform this data before loading it into an existing database. Extract, Transform, and Load (ETL) operations can use Big Data solutions such as HDInsight to perform pattern matching, data categorization, de-duplication, and summary operations on unstructured or semi- structured data to generate data in the familiar rows and columns format that can be imported into a database table or a data warehouse. The data transfer, data cleansing, or ETL model is typically suited to the following scenarios: Extracting and transforming data before you load it into your existing databases or analytical tools. Performing categorization and restructuring of data, and for extracting summary results to remove duplication and redundancy. Preparing data so that it is in the appropriate format and has appropriate content to power other applications or services.
ETL may not seem like the most exciting process in the world of IT, but its the primary way that we can ensure that the data is valid and contains correct values. ETL gets the data into the correct format for our database tables, while data cleansing is the gatekeeper that protects our tables from invalid, duplicated, or incorrect values. Architecture Overview Figure 3 shows an overview of the data transfer, data cleansing, or ETL model. Input data is transformed by HDInsight to generate the appropriate output format and data content, and then imported into the target data store, application, or reporting solution. Analysis and reporting can then be done against the data store, often by combining the imported data with existing data in the data store. Alternatively, applications can use data in a specific format for a variety of purposes.
Figure 3 High-level architecture for the data transfer, data cleansing, or ETL model The transformation process in HDInsight may involve just a single query, but it is more likely to require a multi- step process. For example, it might use custom Map and Reduce components or (more likely) Pig scripts, followed by a Hive query in the final stage to generate the table format. However, the final format for import into a data store may be something other than a table; for example, it may be a tab delimited file or some other format suitable for import into the target data store. Data Sources for the Data Transfer, Data Cleansing, or ETL Model Data sources for this model are typically external data that can be matched on a key to existing data in your data store so that it can be used to augment the results of analysis and reporting processes. Some examples are: Social media data, log files, sensors, and applications that generate data files. Datasets obtained from Windows Azure Data Market and other commercial data providers. Streaming data filtered or processed through SQL Server StreamInsight.
Output Targets for the Data Transfer, Data Cleansing, or ETL Model This model is designed to generate output that is in the appropriate format for the target data store. Common types of data store are: A database such as SQL Server or Windows Azure SQL Database. A document or file sharing mechanism such as SharePoint server or other information management systems. A local or remote data repository in a custom format, such as JSON objects. Cloud data stores such as Windows Azure table or blob storage. Applications or services that require data to be processed into specific formats, or files that contain specific types of information structure.
You may decide to use this approach even when you dont actually want to keep the results of the HDInsight query. You can load it into your database, generate the reports and analyses you require, and then delete the data from the database. You may need to do this every time if the source data in HDInsight changes between each reporting cycle so that just adding new data is not appropriate. Considerations for the Data Transfer, Data Cleansing, or ETL Model There are some important points to consider when choosing the data transfer, data cleansing, or ETL model. This model is typically used when you want to: Load stream data or large volumes of semi-structured or unstructured data from external sources into an existing database or information system. Cleanse, transform, and validate the data before loading it. Generate reports and visualizations that are regularly updated. Power other applications that require specific types of data, such as using an analysis of previous behavioral information to apply personalization to an application or service. When the output is in tabular format, such as that generated by Hive, the data import process can use the Hive ODBC driver. Alternatively, you can use Sqoop (which is included in in the Hadoop distribution installed by HDInsight) to connect a relational database such as SQL Server or Windows Azure SQL Database to your HDInsight data store and export the results of a query into your database. Some other connectors for accessing Hive data are available from Couchbase, Jaspersoft, and Tableau Software. If the target for the data is not a database, you can generate a file in the appropriate format within the HDInsight query. This might be tab delimited format, fixed width columns, some other format for loading into Excel or a third-party application, or even for loading into Windows Azure storage through a custom data access layer that you create. If the intention is to regularly update the target table or data store as the source data changes you will probably choose to use an automated mechanism to execute the query and data import processes. However, if it is a one-off operation you may decide to execute it interactively only when required. Windows Azure table storage can be used to store table formatted data using a key to identify each row. Windows Azure blob storage is more suitable for storing compressed or binary data generated from the HDInsight query if you want to store it for reuse.
For an example of a scenario that uses the data transfer, data cleansing, or ETL architectural style see Scenario 2, Extract, Transform, Load, in Chapter 9 of this guide. Using HDInsight as a Basic Data Warehouse or Commodity Data Store Big Data solutions such as HDInsight can provide a robust, high performance, and cost-effective data storage and querying mechanism. Data is replicated in the storage system, and queries are distributed across the nodes for fast parallel processing. HDInsight also saves the output from queries in the ASV replicated blob storage. This combination of capabilities means that you can use HDInsight as a basic data warehouse. The low cost of storage when compared to most relational database mechanisms that have the same level of reliability also means that you can use HDInsight simply as a commodity storage mechanism for huge volumes of data, even if you decide not to transform it into Hive tables. If you need to store vast amounts of data, irrespective of the format of that data, a Big Data solution such as HDInsight can reduce administration overhead and save money by minimizing the need for the high performance database servers and storage clusters used by traditional relational database systems. You may even choose to use a cloud hosted Big Data solution in order to reduce the requirements and running costs of on- premises infrastructure. The data warehouse or commodity data store model is typically suited to the following scenarios: Exposing both the source data and the results of queries executed over this data in the familiar row and column format, using a range of data types for the columns that includes both primitive types (including timestamps) and complex types such as arrays, maps, and structures. Storing schemas (or, to be precise, metadata) for tables that are populated by the queries you execute. Creating indexes for tables, and partitioning tables based on a clustered index so that each has a separate metadata definition and can be handled separately. Creating views based on tables, and creating functions for use in both tables and queries. Creating data repositories that are robust and reasonably low cost to maintain, which is especially useful if you need to store and manage huge volumes of data. Consuming data directly from HDInsight in business applications, through interactive analytical tools such as Excel, or in corporate reporting platforms such as SQL Server Reporting Services. Storing large quantities of data is a robust yet cost-effective way, for use now or in the future. Architecture Overview Figure 4 shows an overview of the architecture for this model. The source data files may be obtained from external sources or as internal data generated by your business processes and applications. You might decide to use this model instead of using a locally installed data warehouse based on a relational database model. However, there are some limitations when using a Big Data solution such as HDInsight as a data warehouse.
Figure 4 High-level architecture for the data warehouse or commodity data store model This model is also suitable for use as a data store where you do not need to implement the typical data warehouse capabilities. For example, you may just want to minimize storage cost when saving large tabular format data files for use in the future, large text files such as email archives or data that you must keep for legal or regulatory reasons but you do not need to process, or for storing large quantities of binary data such as images or documents. In this case you simply load the data into the Windows Azure blob storage associated with cluster, without creating Hive tables for it. Big Data storage is robust when a cluster is running because the data is replicated at least three times across the nodes of the cluster. When you use HDInsight, the source data in Windows Azure storage is replicated three times as well, including geo-replication unless you specifically disable this option. You might also consider storing partly processed data where you have performed some translation or summary of the data but it is still in a relatively raw form that you want to keep in case it is useful in the future. For example, you might use Microsoft StreamInsight to allocate incoming positional data from a fleet of vehicles into separate categories or areas, and add some reference key to each item, then store the results ready for processing at a later date. Stream data typically generates very large files, and may arrive in rapid bursts, so using a solution such as HDInsight helps to minimize the load on your existing data management systems. When you need to process the archived data, you create an HDInsight cluster based on the Windows Azure blob storage container holding that data. When you finish processing the data you can tear down the cluster without losing the original archived data. See the section Retaining Data and Metadata for a Cluster in Chapter 7, and Scenario 3 in Chapter 10 of this guide, for information about preserving or recreating metadata such as Hive table definitions when you tear down and then recreate an HDInsight cluster. Data Sources for the Data Warehouse or Commodity Data Store Model Data sources for this model are typically data collected from internal and external business processes, but may also include reference data and datasets obtained from other sources that can be matched on a key to existing data in your data store so that it can be used to augment the results of analysis and reporting processes. Some examples are: Data generated by internal business processes, websites, and applications. Reference data and data definitions used by business processes. Datasets obtained from Windows Azure Data Market and other commercial data providers.
If you adopt this model simply as a commodity data store rather than a data warehouse, you might also load data from other sources such as social media data, log files, and sensors; or streaming data filtered or processed through SQL Server StreamInsight. Output Targets for the Data Warehouse or Commodity Data Store Model The main intention of this model is to provide the equivalent to an internal data warehouse system based on the traditional relational database model, and expose it as Hive tables. You can use these tables in a variety of ways, such as: By combining the datasets for analysis, and to generate reports and business information To generate ancillary information such as related items or recommendation lists for use in applications and websites. For external access through web applications, web services, and other services. To power information systems such as SharePoint server through web parts and the Business Data Connector (BDC).
If you adopt this model simply as a commodity data store rather than a data warehouse, you might use the data you store as an input for any of the models described in this chapter. The data in an HDInsight data warehouse can be analyzed and visualized using any tools that can consume Hive tables. Typical examples are: SQL Server Reporting Services SQL Server Analysis Services Interactive analytical tools such as Excel, PowerPivot, Power View, and GeoFlow Custom or third party analysis and visualization tools
You will see more details of the reporting tools in Chapter 5, Consuming and Visualizing the Results, of this guide. The section Corporate Data Model Level Integration later in this chapter discusses SQL Server Analysis Services. You can also download a case study that describes using SQL Server Analysis Services with Hive. Considerations for the Data Warehouse or Commodity Data Store Model There are some important points to consider when choosing the data collection, computation, and analysis model for querying and visualizing data with HDInsight. This model is typically used when you want to: Create a central point for analysis and reporting of Big Data by multiple users and tools. Store multiple datasets for use by internal applications and tools. Host your data in the cloud to benefit from reliability and elasticity, minimize cost, and reduce administration overhead. Store both externally collected data and data generated by internal tools and processes. Refresh the data at scheduled intervals or on demand. You can use Hive in HDInsight to: Define tables that have the familiar row and column format, with a range of data types for the columns that includes both primitive types (including timestamps) and complex types such as arrays, maps, and structures. Load data from the file system into tables, save data to the file system from tables, and populate tables from the results of running a query. Create indexes for tables, and partition tables based on a clustered index so that each has a separate metadata definition and can be handled separately. Rename, alter and drop tables, and modify columns in a table as required. Create views based on tables, and create functions for use in both tables and queries. For more details of how to work with Hive tables, see Hive Data Definition Language on the Apache Hive website. The main limitation of Hive tables compared to a relational database is that you cannot create constraints such as foreign key relationships that are automatically managed. Incoming data may be processed by any type of query, not just Hive, to cleanse and validate the data before converting it to table format. You can store the Hive queries and views within HDInsight so that they can be used in much the same way as the stored procedures in a relational database to extract data on demand. However, keep in mind that the core Hadoop engine is designed for batch processing by distributing data and queries across all the nodes in the cluster. To minimize response times you will probably need to pre-process the data where possible using queries within HDInsight, and store these intermediate results in order to reduce the time-consuming overhead of complex queries. You can use theHDInsight ODBC connector in SQL Server to create linked servers. This allows you to write Transact-SQL queries that join tables in a SQL Server database to tables stored in an HDInsight data warehouse. If you want to be able to delete and restore the cluster, as described in Chapter 7 of this guide, you must create the Hive tables as external table definitions. See the section Managing Hive Table Data Location and Lifetime in Chapter 4 of this guide for more information about creating external tables in Hive.
For an example of a scenario that uses the basic data warehouse or commodity data store architectural style, see Scenario 3, HDInsight as a Data Warehouse, in Chapter 10 of this guide. Integrating HDInsight with Enterprise Data Warehouses and BI You saw in Chapter 1 that Microsoft offers a set of applications and services to support end-to-end solutions for working with data. The Microsoft data platform includes all of the elements that make up an enterprise level BI system for organizations of any size. Organizations that already use an enterprise BI solution for business analytics and reporting can extend their analytical capabilities by using HDInsight to add new sources of data to their decision making processes. This model is typically suited to the following scenarios: You have an existing enterprise data warehouse and BI system that you want to augment with data from outside your organization. You want to explore new ways to combine data to provide better insight into history and to predict future trends. You want to give users more opportunities for self-service reporting and analysis that combines managed business data and Big Data from other sources.
The information in this section will help you to understand how you can integrate HDInsight with an enterprise BI system. However, a complete discussion of the factors related to the design and implementation of a data warehouse and BI solution is beyond the scope of this guide. Appendix B, An Overview of Enterprise Business Intelligence provides a high-level description of the key features commonly found in an enterprise BI infrastructure. You will find this useful if you are not already familiar with such systems.
Architecture Overview Enterprise BI is a topic in itself, and there are several factors that require special consideration when integrating a Big Data solution such as HDInsight with an enterprise BI system. If you are not already familiar with enterprise BI systems, it is worthwhile exploring Appendix B, An Overview of Enterprise Business Intelligence of this guide. It will help you to understand the main factors that make an enterprise BI system different from transactional data management scenarios. Figure 5 shows an overview of a typical enterprise data warehouse and BI solution. Data flows from business applications and other external sources through an ETL process and into a data warehouse. Corporate data models are then used to provide shared analytical structures such as online analytical processing (OLAP) cubes or data mining models, which can be consumed by business users through analytical tools and reports. Some BI implementations also enable analysis and reporting directly from the data warehouse, enabling advanced users such as business analysts and data scientists to create their own personal analytical data models and reports.
Figure 5 Overview of a typical enterprise data warehouse and BI implementation There are many technologies and techniques that you can use to combine HDInsight and existing BI technologies, but as a general guide there are three primary ways in which HDInsight can be used with an enterprise BI solution: 1. Report level integration. Data from HDInsight is used in reporting and analytical tools to augment data from corporate BI sources, enabling the creation of reports that include data from corporate BI sources as well as HDInsight, and also enabling individual users to combine data from both solutions into consolidated analyses. This level of integration is typically used for creating mashups, exploring datasets to discover possible queries that can find hidden information, and for generating one off reports and visualizations. 2. Corporate data model level integration. HDInsight is used to process data that is not present in the corporate data warehouse, and the results of this processing are then added to corporate data models where they can be combined with data from the data warehouse and used in multiple corporate reports and analysis tools. This level of integration is typically used for exposing the data in specific formats to information systems, and for use in reporting and visualization tools. 3. Data warehouse level integration. HDInsight is used to prepare data for inclusion in the corporate data warehouse. The HDInsight data that has been loaded is then available throughout the entire enterprise BI solution. This level of integration is typically used to create standalone tables on the same database hardware as the enterprise data warehouse, which provides a single source of enterprise data for analysis, or to incorporate the data into a dimensional schema and populate dimension and fact tables for full integration into the BI solution.
Figure 6 shows an overview of these three levels of integration.
Figure 6 Three levels of integration for HDInsight with an enterprise BI system The following sections describe these three integration levels in more detail to help you understand the implications of your choice. They also contain guidelines for implementing each one. However, keep in mind that you dont need to make just a single choice for all of your processes; you can use a different approach for each dataset that you extract from HDInsight, depending on the scenario and the requirements for that dataset. Choose a suitable integration level for each dataset or data querying operation you carry out. You arent limited to using the same one for every task that interacts with your Big Data solution. Report Level Integration Using HDInsight in a standalone non-integrated way, as described earlier in this chapter, can unlock information from currently unanalyzed data sources. However, much greater business value can be obtained by integrating the results from HDInsight queries with data and BI activities that are already present in the organization. In contrast to the rigidly defined reports created by BI developers, a growing trend is a self-service approach. In this approach the data warehouse provides some datasets based on data models and queries defined there, but the user selects the datasets and builds a custom report. The ability to combine multiple data sources in a personal data model enables a more flexible approach to data exploration that goes beyond the constraints of a formally managed corporate data warehouse. Users can augment reports and analysis of data from the corporate BI solution with additional data from HDInsight to create a mashup solution that brings data from both sources into a single, consolidated report. You can use the following techniques to integrate output from HDInsight with enterprise BI data at the report level. Use the Data Explorer add-in to download the output files generated by HDInsight and open them in Excel, or import them into a database for reporting. Create Hive tables in HDInsight and consume them directly from Excel (including using PowerPivot or GeoFlow) or from SQL Server Reporting Services (SSRS) by using the ODBC driver for Hive. Use Sqoop to transfer the results from HDInsight into a relational database for reporting. For example, copy the output generated by HDInsight into a Windows Azure SQL Database table and use Windows Azure SQL Reporting Services to create a report from the data. Use SQL Server Integration Services (SSIS) to transfer and, if required, transform HDInsight results to a database or file location for reporting. If the results are exposed as Hive tables you can use an ODBC data source in an SSIS data flow to consume them. Alternatively, you can create an SSIS control flow that downloads the output files generated by HDInsight and uses them as a source for a data flow.
Microsoft SQL Server Connector for Apache Hadoop (SQL Server-Hadoop Connector) is a Sqoop-based connector for data transfer between SQL Server 2008 R2 and Hadoop. You can download this from the Microsoft Download Center.
Corporate Data Model Level Integration Integration at the report level makes it possible for advanced users to combine data from HDInsight with existing corporate data sources to create data mashups. However, there may be some scenarios where you need to deliver combined information to a wider audience, or to users who do not have the time or ability to create their own complex data analysis solutions. Alternatively, you might want to take advantage of the functionality available in corporate data modeling platforms such as SQL Server Analysis Services (SSAS) to add value to the Big Data insights gained from HDInsight. By integrating data from HDInsight into corporate data models you can accomplish both of these aims, and use the data as the basis for enterprise reporting and analytics.
Integrating the output from HDInsight with your corporate data models allows you to use tools such as SQL Server Analysis Services to analyze the data and present it in a format that is easy to use in reports, or for performing deeper analysis.
You must choose between the multidimensional and the tabular data mode when you install SQL Server Analysis Services, though you can install two instances if you need both modes. When installed in tabular mode, SSAS supports the creation of data models that include data from multiple diverse sources, including ODBC-based data sources such as Hive tables. You can use the following techniques to integrate HDInsight results into a corporate data model: Create Hive tables in HDInsight and consume them directly from a SQL Server Analysis Services (SSAS) tabular model by using the ODBC driver for Hive. SSAS in tabular mode supports the creation of data models from multiple data sources and includes an OLE DB provider for ODBC, which can be used as a wrapper around the ODBC driver for Hive. Create Hive tables in HDInsight and then create a linked server in the instance of the SQL Server database source used by an SSAS multidimensional data model so that the Hive tables can be queried through the linked server and imported into the data model. SSAS in multidimensional mode can only use a single OLE DB data source, and the OLE DB provider for ODBC is not supported. Use Sqoop or SSIS to copy the data from HDInsight to a SQL Server database engine instance that can then be used as a source for an SSAS tabular or multidimensional data model.
When installed in multidimensional mode, SSAS data models cannot be based on ODBC sources due to some restrictions in the designers for multidimensional database objects. To use Hive tables as a source for a multidimensional SSAS model, you must either extract the data from Hive into a suitable source for the multidimensional model (such as a SQL Server database), or use the ODBC driver for Hive to a define a linked server in a SQL Server instance that provides pass through access to the Hive tables, and then use the SQL Server instance as the data source for the multidimensional model. You can download a case study that describes how to create a multidimensional SSAS model that uses a linked server in a SQL Server instance to access data in Hive tables.
Data Warehouse Level Integration Integration at the corporate data model level enables you to include data from HDInsight in a data model that is used for analysis and reporting by multiple users, without changing the schema of the enterprise data warehouse. However, in some cases you might want to use HDInsight as a component of the overall ETL solution that is used to populate the data warehouse, in effect using a source of Big Data queried through HDInsight just like any other business data source in the BI solution, and consolidating the data from all sources to an enterprise dimensional model. Full integration with a data warehouse at the data level allows HDInsight to become just another data source that is consolidated into the system, and which can be used in exactly the same way as data from other sources. In addition, as with other data sources, its likely that the data import process from HDInsight into the database tables will occur on a schedule, depending on the time taken to execute the queries and perform ETL tasks prior to loading it into the database tables, so that the data is as up to date as possible. You can use the following techniques to integrate data from HDInsight into an enterprise data warehouse. Use Sqoop to copy data from HDInsight directly into database tables. These might be tables in a SQL Server data warehouse that you do not want to integrate into the dimensional model of the data warehouse; or they may be tables in a staging database where the data can be validated, cleansed, and conformed to the dimensional model of the data warehouse before being loaded into the fact and dimension tables. Any firewalls located between HDInsight and the target database must be configured to allow the database protocols that Sqoop uses. Use PolyBase for SQL Server to copy data from HDInsight directly into database tables. These might be tables in a SQL Server data warehouse that you do not want to integrate into the dimensional model of the data warehouse; or they may be tables in a staging database where the data can be validated, cleansed, and conformed to the dimensional model of the data warehouse before being loaded into the fact and dimension tables. Any firewalls located between HDInsight and the target database must be configured to allow the database protocols that SQL Server uses. Create an SSIS package that reads the output file from HDInsight, or uses the ODBC driver for Hive to extract data from HDInsight, and then validates, cleanses, and transforms the data before loading it into the fact and dimension tables in the data warehouse.
PolyBase is a component of SQL Server Parallel Data Warehouse (PDW) and is available only on PDW appliances. For more information see the PolyBase page on the SQL Server website. When reading data from HDInsight you must open port 1000 in the HDInsight management portal. When reading data from SQL Server using a connector you must enable the appropriate firewall rules. For more information see Configure the Windows Firewall to Allow SQL Server Access. Data Sources for the HDInsight Integration with Enterprise BI Model The input data for HDInsight can be almost anything, but for the HDInsight integration with enterprise BI model they typically include the following: Social media data, log files, sensors, and applications that generate data files. Datasets obtained from Windows Azure Data Market and other commercial data providers. Streaming data filtered or processed through SQL Server StreamInsight. Output Targets for the HDInsight Integration with Enterprise BI Model The results from your HDInsight queries can be visualized using any of the wide range of tools that are available for analyzing data, combining it with other datasets, and generating reports. Typical examples for the HDInsight integration with enterprise BI model are: SQL Server Reporting Services. SharePoint Server or another information management system. Business performance dashboards such as PerformancePoint Services in SharePoint Server. Interactive analytical tools such as Excel, PowerPivot, Power View, and GeoFlow. Custom or third party analysis and visualization tools.
You will see more details of these tools in Chapter 5, Consuming and Visualizing the Results, of this guide. Considerations for the HDInsight Integration with Enterprise BI Model There are some important points to consider when choosing the HDInsight integration with enterprise BI model for querying and visualizing data with HDInsight. This model is typically used when you want to: Integrate external data sources with your enterprise data warehouse. Augment the data in your data warehouse with external data. Update the data at scheduled intervals or on demand. ETL processes in a data warehouse usually execute on a scheduled basis to add new data to the warehouse. If you intend to integrate HDInsight into your data warehouse so that it updates the information stored there, you must consider how you will automate and schedule the tasks of executing the HDInsight query and importing the results. You must ensure that data imported from HDInsight contains valid values, especially where there are typically multiple common possibilities (such as in street addresses and city names). You may need to use a data cleansing process such as Data Quality Services to force such values to the correct leading value. Most data warehouse implementations use slowly changing dimensions to manage the history of values that change over time. Different versions of the same dimension member have the same alternate key but unique surrogate keys, and so you must ensure that data imported into the data warehouse tables uses the correct surrogate key value. This means that you must either: Use some complex logic to match the business key in the source data (which will typically be the alternate key) with the correct surrogate key in the data model when you join the tables. If you simply join on the alternate key, some loss of data accuracy may occur because the alternate key is not guaranteed to be unique. Load the data into the data warehouse and conform it to the dimensional data model, including setting the correct surrogate key values. For more information about leading values and surrogate keys in an enterprise BI system see Appendix B, An Overview of Enterprise Business Intelligence.
One of the trickiest parts of full data warehouse integration is matching rows imported from a Big Data solution to the correct members in the data warehouse dimension tables. You must use a combination of the alternate key and the point in time to which the imported row relates to look up the correct surrogate key. This key can differ based on the date that changes were made to the original entity. For example, a product may have more than one surrogate key over its lifetime, and the imported data must match the correct version of this key. The following table summarizes the primary considerations for deciding whether to adopt one of the three integration approaches described in this chapter. Each approach has its own benefits and challenges. Integration Level Typical Scenarios Considerations None No enterprise BI solution currently exists. The organization wants to evaluate HDInsight and Big Data analysis without affecting current BI and business operations. Business analysts in the organization want to explore external or unmanaged data sources not included in the managed enterprise BI solution. Open-ended exploration of unmanaged data could produce a lot of business information, but without rigorous validation and cleansing the data might not be completely accurate. Continued experimental analysis may become a distraction from the day-to-day running of the business, particularly if the data being analyzed is not related to core business data. Report Level A small number of individuals with advanced self- service reporting and data analysis skills need to augment corporate data that is available from the enterprise BI solution with Big Data from other sources. A single report combining corporate data and some external data is required for one-time analysis. It can be difficult to find common data values on which to join data from multiple sources to gain meaningful comparisons. Integrated data in a report can be difficult to share and reuse, though the ability to share PowerPivot workbooks in SharePoint Server may provide a solution for small to medium sized groups of users. Corporate Data Model Level A wide audience of business users must consume multiple reports that rely on data from both the corporate data warehouse and HDInsight. These users may not have the skills necessary to create their own reports and data models. Data from HDInsight is required for specific business logic in a corporate data model that is used for reports and dashboards. It can be difficult to find common data values on which to join data from multiple sources. Integrating data from HDInsight into tabular SSAS models can be accomplished easily through the ODBC driver for Hive. Integration with multidimensional models is more challenging. Data Warehouse Level The organization wants a complete, managed BI platform for reporting and analysis that includes business application data sources as well as Big Data sources. The desired level of integration between data from HDInsight and existing business data, and the required tolerance for data integrity and accuracy across all data sources, necessitates a formal dimensional model with conformed dimensions. Data warehouses typically have demanding data validation requirements. Loading data into a data warehouse that includes slowly changing dimensions with surrogate keys can require complex ETL processes.
For an example of a scenario that uses the enterprise data warehouse and BI integration architectural style see Scenario 4, Enterprise BI Integration, in Chapter 11 of this guide. Summary In this chapter you have seen the general use cases for using HDInsight as part of your business information and data management systems. There are, of course, many specific use cases that fall within those identified in this chapter. However, all are subcases of the models you have seen in this chapter, which are: As a data analysis experimentation and visualization tool for handling data that you cannot process using existing systems. As a data transfer, data cleansing, or ETL mechanism to extract and transform data before you load it into your existing databases or data visualization tools. As a basic data warehouse or commodity storage mechanism to store both the source data and the results of queries executed over this data, or as a data repository that is robust and reasonably low cost to maintain. Integration with an enterprise data warehouse and BI system at different levels, depending on the way that you intend to use the data obtained from HDInsight.
Each model has different advantages and suits different scenarios. There is no single approach that is best, and you must decide based on your own circumstances, the existing tools you use, and the ultimate aims you have for your Big Data implementation. Theres nothing to prevent you from using it in all of the ways you have seen in this chapter! You have also seen how analytics and data visualization are vitally important parts of most organizations IT systems. Organizations typically require their databases to power the daily processes of taking orders, generating invoices, paying employees, and a myriad other tasks. However, accurate reporting information and analysis of the data the organization holds and collects from external sources is becoming equally vital to maintain competitiveness, minimize costs, and successfully exploit business opportunities. More Information For more information about HDInsight, see the Windows Azure HDInsight web page. For more details of how HDInsight uses Windows Azure blob storage, see Using Windows Azure Blob Storage with HDInsight on the Windows Azure HDInsight website. For more information about Microsoft SQL Server data warehouse solutions see Data Warehousing on the SQL Server website. The document Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS) contains a wealth of useful information about using SSIS with HDInsight.
Chapter 3: Collecting and Loading Data This chapter explores how you can get data into your Big Data solution. It focuses primarily on Windows Azure HDInsight, but many of the techniques you will see described here are equally relevant to Big Data solutions built on other Big Data frameworks and implementations. The chapter starts with an exploration of several different but typical data ingestion patterns that are generally applicable to any Big Data solution. These patterns include ways to handle streaming data and techniques for automating the ingestion process. The chapter then describes how you might implement these patterns when using Windows Azure HDInsight, including details of how you can apply compression to the source data before you load it. Subsequent chapters of this guide explore techniques for executing queries after you have loaded the data, and consuming the results of your queries. Challenges The first stage in using a Big Data solution to analyze your data is to load that data into the cluster storage. In some cases, such as when the data source is an internal business application or database, extracting the data into a text file that can be consumed by your solution is relatively straightforward. In the case of external data obtained from sources such as governments and commercial data providers, the data is often available for download in text format. However, in many cases you may need to extract data through a web service or other API. As you saw in the section Data Storage in Windows Azure HDInsight in Chapter 2, HDInsight for Windows Azure uses an HDFS-compatible storage implementation that stores the data in Windows Azure blob storage as an Azure Storage Vault (ASV), instead of on each server in the cluster. The data for an HDInsight cluster is transferred from blob storage to each data node in the cluster when you execute a query. Chapter 2 explains why Windows Azure HDInsight uses blob storage for the cluster data. One advantage is that you can upload data to HDInsight before you define and initialize your cluster, or when the cluster is not running, which helps to minimize the runtime costs of the cluster. Generally, you submit source data in the form of a text file often delimited, but sometimes completely unstructured. Source files for data processing in HDInsight are uploaded to folders in Windows Azure blob storage, from where HDInsight map/reduce jobs can load and process them. Usually this is the ASV container associated with the cluster, although you can upload the files first and then associate a new cluster with the blob container. There are several challenges that you face when loading the data into cluster storage. First, unless you are simply experimenting with small datasets, you must consider how you will handle very large volumes of data. While small volumes of data can be copied into Windows Azure blob storage interactively by using the HDInsight interactive console, Visual Studio, or other tools, you will need to build a more robust solution capable of handling large files when you move beyond the experimentation stage. You may also need to load data from a range of different data sources such as websites, RSS feeds, custom applications and APIs, relational databases, and more. Its vital to ensure that you can submit this data efficiently and accurately to cluster storage, including performing any preprocessing that may be required on the data to convert it into a suitable form. For example, you may want to split the data into smaller files for easier selection and uploading, such as creating separate files for each month in the data, which can reduce the processing and file handling required for some queries. However, using small files can hit performance, and so if you have a lot of small files you may want to combine these into larger files to make better use of the splitting mechanisms that Hadoop uses when reading, sorting, and writing data; which can improve performance in some cases. You may also need to format individual parts of the data by, for example, combining fields in an address, removing duplicates, converting numeric values to their text representation, or changing date strings to standard numerical date values. The operations that you need carry out might depend on the original data source and format, and type of query that you intend to perform, and the format you require for the results. You may even want to perform some automated data validation and cleansing by using a technology such as SQL Server Data Quality Services before submitting the data to cluster storage. For example, you might need to convert different versions of the same value into a single leading value (such as changing NY and Big Apple into New York). However, keep in mind that many of these data preparation tasks may not be practical, or even possible, when you have very large volumes of data. They are more likely to be possible when you stream data from your data sources, or extract it in small blocks on a regular basis. Where you have large volumes of data to process you will probably perform these preprocessing tasks within your Big Data solution as the initial steps in a series of transformations and queries. Be careful when removing information from the data; if possible keep a copy of the original files. You may subsequently find the fields you removed are useful as you refine queries, or if you use the data for a completely different analytical task. You must also consider security and efficiency, both of the upload process and the storage itself. You can use SSL when uploading data to protect it on the wire. If you need to encrypt the data during the upload process, and perhaps also while in storage, you must consider how, when, and where you will unencrypt it; and how you will encrypt the results if this is also a requirement. In addition to encrypting the data, you may also need to compress it to minimize storage use, and to minimize data transfer time and cost. Finally, you must consider how the input data will arrive for processing. It may be in the form of files that are ready to be uploaded (after any pre-processing that is required), but in some cases you may need to handle data that arrives as a stream. This can impose challenges because Hadoop is a batch processing job engine. It means that you must be able to convert and buffer the incoming data so that it can be processes in batches.
This chapter addresses all of these challenges described here. It covers: Common patterns for loading data into a Big Data solution such as HDInsight. Tools and technologies for HDInsight data ingestion. Considerations for uploading data to HDInsight, such as compression and the use of custom upload tools.
Common Data Load Patterns for Big Data Solutions This section describes some common use cases for obtaining source data and uploading it to your cluster storage. The patterns are: Interactive data ingestion. This approach is typically used for small volumes of data and when experimenting with datasets. Automated batch upload. If you decide to integrate a Big Data query process with your existing systems you will typically create an automated mechanism for uploading the data into the cluster storage. Real-time data stream capture. If you need to process data that arrives as a stream you must implement a mechanism that loads this data into cluster storage. Loading relational data. Sometimes you might need to load data into a Big Data solution from a relational database. There are several tools you can use to achieve this. Loading web server log files. Big Data solutions are often used to analyze server log files. You must decide how you will transfer these files into cluster storage for processing.
The following sections of this chapter describe each of these patterns in more detail. A suggested architectural implementation is described for each pattern. Practical implementations of the first three patterns in this list, based on collecting and loading data from Twitter, are described in Appendix A of this guide. For information about how you can upload data directly into Hive external tables, see the section Processing Data with Hive in Chapter 4. Interactive Data Ingestion An initial exploration of data with a Big Data solution often involves experimenting with a relatively small volume of data. In this case there is no requirement for a complex, automated data ingestion solution, and the tasks of obtaining the source data, preparing it for processing, and uploading it to cluster storage can be performed interactively. There are several ways you can upload data to your HDInsight cluster. However, creating automated processes to do this is probably only worthwhile when you know you will repeat the operation on a regular basis. If the source data is not already available as an appropriately formatted file, you can use Excel to extract a relatively small volume of tabular data from a data source, reformat it as required, and save it as a delimited text file. Excel supports a range of data sources, including relational databases, XML documents, OData feeds, and the Windows Azure Data Market. You can also use Excel to import a table of data from any website, including an RSS feed. In addition to the standard Excel data import capabilities, you can use add-ins such as the Data Explorer to import and transform data from a wide range of sources. After you have used Excel to prepare the data and save it as a text file, you can upload it using the hadoop dfs - copyFromLocal [source] [destination] command at the Hadoop command line, or up-load it to an HDInsight cluster by using the fs.put() command in the interactive console. This capability is useful if you are just experimenting or working on a proof of concept. If you need to upload a file larger than this you can use a command line tool such as AzCopy to upload data to ASV storage, a UI-based tool such as Windows Azure Storage Explorer, a script that uses the Windows Azure PowerShell Cmdlets, or Server Explorer in Visual Studio. This approach for HDInsight is shown in Figure 1.
Figure 1 Interactive Data Ingestion Automated Batch Upload When a Big Data solution grows beyond the initial exploration stages it is common to automate data ingestion so that data can be extracted from sources and loaded into the cluster on a scheduled basis without human intervention. SQL Server Integration Services (SSIS) provides a platform for building automated Extract, Transform, and Load (ETL) workflows that perform the necessary tasks to extract source data, apply transformations, and upload the resulting data to the cluster. An SSIS solution consists of one or more packages, each containing a control flow to perform a sequence of tasks. Tasks in a control flow can include calls to web services, FTP operations, file system tasks, command line tasks, and others. In particular, a control flow usually includes one or more data flow tasks that encapsulate an in-memory, buffer-based pipeline of data from a source to a destination, with transformations applied to the data as it flows through the pipeline. You can create an SSIS package that extracts data from a source and uploads it to the cluster, and then automate execution of the package using the SQL Server Agent service. The package usually consists of a data flow to extract the data from its source through an appropriate source adapter, and then apply any required transformations to the data. To get the data into the appropriate destination folder you can use one of the following options: Procure or implement a custom destination adapter that programmatically writes the data to the cluster, and add it to the data flow. Use a task in the data flow in the SSIS package to download the data to a local file, and then use a command line task in the control flow to automate a file upload utility that uploads the file.
As an alternative you can use scripts, tools, or utilities to perform the upload, perhaps driven by a scheduler such as Windows Scheduled Tasks. Some options for this approach are: Use the Hadoop .Net HDFS File Access library that is installed in the C:\Apps\dist\contrib\WinLibHdfs folder of HDInsight. Use a third party utility or library that can upload files to Windows Azure blob storage. Use the Windows Azure blob storage API. For more details, see How to use the Windows Azure Blob Storage Service in .NET. Create a custom utility using the .NET SDK for Hadoop. For more details, see the section Using the WebHDFSClient in the Microsoft .NET SDK for Hadoop later in this chapter.
Figure 2 shows an automated batch upload architecture for HDInsight.
Figure 2 Automated batch upload with SSIS For more information about creating ETL solutions with SSIS, see SQL Server Integration Services in SQL Server Books Online. There is also a set of sample components available to simplify creating SSIS packages for uploading data to Windows Azure blob storage. For more information, see Azure Blob Storage Components for SSIS Sample.
If you already use SQL Server Integration Services as part of your BI system you should consider creating packages that automate the uploading of data to HDInsight for regular analytical processes.
Real-Time Data Stream Capture In some scenarios the data that must be analyzed is available only as a real-time stream of events; for example, from a streaming web feed or electronic sensors. Microsoft StreamInsight is a complex event processing (CEP) engine with a framework API for building applications that consume and process event streams. Big Data solutions are optimized for batch processing of large volumes of data, but you can use StreamInsight to capture events that are redirected to a real-time visualization solution and are also captured for submission to the Big Data solution storage for historic analysis. StreamInsight is available as an on-premises solution, and is under development as a Windows Azure service. Regardless of the implementation you decide to use, a StreamInsight application consists of a standing query that uses LINQ syntax to extract events from a data stream. Events are implemented as classes or structs, and the properties defined for the event class provide the data values for visualization and analysis. You can create a StreamInsight application by using the Observable/Observer pattern, or by implementing input and output adapters from base classes in the StreamInsight framework.
Big Data solutions cannot handle streaming data sourcesthey work by querying across distributed files stored on disk. StreamInsight is one solution for converting incoming stream data into files that can be included in a batch process such as an HDInsight query, although you can create your own mechanism for capturing and storing data from a stream. When using StreamInsight to capture events for analysis in a Big Data solution you can implement code to write the captured events to a local file for batch upload to the cluster storage, or you can implement code that writes the event data directly to the cluster storage. Your code can append each new event to an existing file, create a new file for each individual event, or periodically create a new file based on temporal windows in the event stream.
Figure 3 shows a real-time data stream capture architecture for HDInsight.
Figure 3 Real-time data stream capture with StreamInsight For an example of using StreamInsight, see Appendix A of this guide. For more information about developing StreamInsight applications, see the Microsoft StreamInsight on MSDN. Loading Relational Data Source data for analysis is often obtained from a relational database. To support this common scenario you can use Sqoop to extract the data you require from a table, view, or query in the source database and save the results as a file in your cluster storage. In HDInsight this approach makes it easy to transfer data from a relational database when your infrastructure supports direct connectivity from the HDInsight cluster to the database server, such as when your database is hosted in Windows Azure SQL Database or a virtual machine in Windows Azure. Approaches for loading relational data into HDInsight are shown in Figure 4.
Figure 4 Loading relational data into HDInsight Its quite common to use HDInsight to query data extracted from a relational database. For example, you may use it to search for hidden information in data exported from your corporate database or data warehouse as part of a sandboxed experiment, without absorbing resources from the database or risking damaging the existing data. Loading Web Server Log Files Analysis of web server log data is another common use case for Big Data solutions, and requires that log files be uploaded to the cluster storage. Flume is an open source project for an agent-based framework to copy log files to a central location, and is often used to load web server logs to HDFS for processing in Hadoop. Flume is not included in HDInsight, but can be downloaded from the Flume project web site and installed on Windows. As an alternative to using Flume, you can use SSIS to implement an automated batch upload solution, as described earlier. Options for uploading web server log files in HDInsight are shown in Figure 5.
Figure 5 Loading web server logs into HDInsight Uploading Data to HDInsight The preceding sections of this chapter describe the general patterns for uploading data to the cluster storage of a Big Data solution. The interactive data ingestion approach described earlier in this chapter is fine for experimentation and carrying out one-off analysis of datasets. However, when you have a large number of files to upload, or you want to implement a repeatable solution, using the interactive console is not practical. Instead, you may want to take advantage of third party tools, or create your own custom mechanisms to upload data. In addition, in some situations you may need to encrypt the data in storage, or automatically compress it as part of an automated upload process. HDInsight for Windows Azure uses Windows Azure blob storage to host the data. In this section you will see some of the practical considerations and implementation guidance for automating data upload to Windows Azure blob storage. You will find more details about compressing the data later in this chapter. Tools and Technologies for HDInsight Data Ingestion You can create custom applications or scripts to extract source data, or use off-the-shelf tools and platforms. Common tools on the Microsoft platform for extracting data from a source include Excel, SQL Server Integration Services (SSIS), and StreamInsight. To upload the data to HDInsight you can use a variety of file upload tools such as AzCopy (which can be used as a standalone command line tool or as a library from your own code), or you can build a custom solution. The Hadoop ecosystem includes tools that are specifically designed to simplify the loading or particular kinds of data into cluster storage. Sqoop provides a mechanism for transferring data between a relational database and cluster storage, and Flume provides a framework for copying web server log files to an HDFS location in cluster storage. For useful list of third party tools, see How to Upload Data to HDInsight on the Windows Azure website. This page also demonstrates how to use the Windows Azure Storage Explorer. When experimenting with HDInsight, remember that you can access Windows Azure storage from Server Explorer after you install the Windows Azure SDK for Visual Studio. You can also use the Windows Azure PowerShell Cmdlets in your scripts to upload data to Windows Azure blob storage.
Consider the following points when you are planning your data upload strategy: Many of the HDInsight tools (such as Hive and Pig) use the complete contents of a folder as a data source, rather than an individual file, so you can split your data across multiple files in the same folder and still be able process it in a single query. However, as a general rule it is usually more efficient to consolidate your data into larger files before uploading them if you have a very large number of small files, or append new data to existing files. The default block size that HDInsight uses to distribute data for processing in parallel across multiple cluster nodes is 256 MB, although this can be changed by setting the dfs.block.size property. In some cases (such as when processing image files) it may not be possible to combine small files. If possible, choose a utility that can upload data in parallel using multiple threads to reduce upload time. Some utilities may be able to upload multiple blocks or small files in parallel and then combine them into larger files after they have been uploaded. The optimum number of threads is typically between 8 and 16, though you should choose a number based on the number of logical processors and the bandwidth available. An excessive number of concurrent operations will have an adverse effect on overall system performance due to thread pool demands and frequent context switching. You can often improve the efficiency and reduce the runtime of jobs that use large volumes of data by compressing the data you upload. You can do this before you upload the data, or within a custom utility as part of the upload process. Consider using SSL for the connection to your storage account to protect the data on the wire. The URI scheme for ASVS provides SSL encrypted access, which is recommended. If you are handling sensitive data that must be encrypted you will need to write custom serializer and deserializer classes and install these in the cluster for use as the SerDe parameter in Hive statements, or create custom map/reduce components that can manage the serialization and deserialization.
Combining small files into larger ones, or appending data to existing files, can help to reduce the work HDInsight needs to do when managing the files and moving them between the cluster nodes. However, this may not be practical for some types of files. Implementing a Custom Data Upload Utility You may need to create your own custom upload utility if you have special requirements, such as uploading from sources other than local files, which cannot be met by third-party tools. If so, consider the following points: Implement your solution as a library that can be called programmatically from scripts or from custom components in an SSIS data flow or a StreamInsight application. This will ensure that the solution provides flexibility for use in future Big Data projects as well as the current one. Create a command line utility that uses your library so that it can be used interactively or called from an SSIS control flow. Include a robust connection retry strategy to maximize reliability. The storage client provided in the Windows Azure SDK uses a built-in retry mechanism. Alternatively, you can use a library such as the Enterprise Library Transient Fault Handling Application Block to implement a comprehensive configurable approach if you prefer. If you need to upload very large volumes of data you must be aware of Windows Azure blob store limitations such as blob, container, and storage size and the maximum data writing speed. Investigate the capabilities of existing libraries such as Windows Azure PowerShell Cmdlets, the Windows Azure storage client library, and the.NET SDK for Hadoop. An example solution named BlobManager is included in the Hands-on Labs you can download for this guide. You can use it as a starting point for your own custom utilities. Download the labs from http://wag.codeplex.com/releases/view/103405. The source code is in the Lab 4 - Enterprise BI Integration\Labfiles\BlobUploader subfolder.
For information about the limitations of Windows Azure blob storage, see Understanding Block Blobs and Page Blobs on MSDN. Using the WebHDFSClient in the Microsoft .NET SDK for Hadoop The Microsoft .NET SDK for Hadoop provides the WebHDFSClient client library, which enables you to create custom solutions for uploading and manipulating files in the HDFS file system. By using this library you can implement managed code that integrates processes of obtaining data from a source and uploading it to the HDFS file system of a Hadoop cluster, which is implemented as a Windows Azure blob storage container in HDInsight. Figure 6 shows the relevant parts of the .NET SDK for Hadoop.
Figure 6 The WebHDFSClient is part of the .NET SDK for Hadoop To use the WebHDFSClient class you must use NuGet to import the Microsoft.Hadoop.WebClient package in the Package Manager PowerShell console as shown here. PowerShell install-package Microsoft.Hadoop.WebClient With the package and all its dependencies imported, you can create a custom application that automates HDFS tasks in the cluster. For example, the following code sample shows how you can use the WebHDFSClient class to upload a data file to the cluster. C# namespace MyWebHDFSApp { using System; using Microsoft.Hadoop.WebHDFS; using Microsoft.Hadoop.WebHDFS.Adapters;
public class Program { private const string HadoopUser = "your-user-name"; private const string AsvAccount = "your-asv-account-name"; private const string AsvKey = "your-storage-key"; private const string AsvContainer = "your_container";
static void Main(string[] args) { var localFile = @"C:\data\data.txt"; var storageAdapter = new BlobStorageAdapter(AsvAccount, AsvKey, AsvContainer, true); var hdfsClient = new WebHDFSClient(HadoopUser, storageAdapter);
// upload data using WebHDFSClient Console.WriteLine("Creating folder and uploading data..."); hdfsClient.CreateDirectory("/mydata").Wait(); hdfsClient.CreateFile(localFile, "/mydata/data.txt").Wait();
Console.WriteLine("\nPress any key to continue."); Console.ReadKey(); } } } Compressing the Source Data One technique that can, in many cases, improve query performance and reduce execution time for large source files is to compress the source data before, or as you upload it to your cluster storage. This reduces the time it takes for each node in the cluster to load the data from storage into memory by reducing I/O and network usage. However, it does increase the processing overhead for each node, and so compression cannot be guaranteed to reduce execution time. Compressing the source data before, or as you load it into HDInsight can reduce query execution times and improve query efficiency. If you intend to use the query regularly you should test both with and without compression to see which provides the best performance, and retest over time if the volume of data you are querying grows. However, before you experiment with compressing the source data for a job, apply compression to the mapper and reducer outputs. This is easy to do and it may provide a larger gain in performance. It might be all you need to do in order to achieve acceptable performance. You must experiment with and without compression, and perhaps combine the tests with other changes to the configuration, to determine the optimum solution. For example, changing the number of mappers or the total number of cluster nodes, and enforcing compression within the process as well as compressing the source data, will provide optimum results. Compression typically provides the best performance gains for jobs that are not CPU intensive. For information about maximizing runtime performance of queries, see the section Tuning the Solution in Chapter 4. For information about how you can set configuration values in a query, see the section Configuring HDInsight Clusters and Jobs in Chapter 7. Compression Libraries Available in HDInsight At the time of writing, HDInsight contains three compression codecs; though you can install others if you wish. The following table shows the class name of each codec, the standard file extension for files compressed with the codec, and whether the codec supports split file compression and decompression. Format Codec Extension Splittable DEFLATE org.apache.hadoop.io.compress.DefaultCodec .deflate No GZip org.apache.hadoop.io.compress.GzipCodec .gz No BZip2 org.apache.hadoop.io.compress.BZip2Codec .bz2 Yes A codec that supports splittable compression and decompression allows HDInsight to decompress the data in parallel across multiple mapper and node instances, which typically provides better performance. However, splittable codecs are less efficient at runtime, so there is a trade off in efficiency between each type. There is also a difference in the size reduction (compression rate) that each codec can achieve. For the same data, BZip2 tends to produce a smaller file than GZip but takes longer to perform the decompression. Considerations for Compression the Source Data Consider the following points when you are deciding whether to compress the source data: Compression may not produce any improvement in performance with small files. However, with very large files (for example, files over 100 GB) compression is likely to provide dramatic improvement. The gains in performance also depend on the contents of the file and the level of compression that was achieved. When optimizing a job, enable compression within the process using the configuration settings to compress the output of the mappers and the reducers before experimenting with compression of the source data. Compression within the job stages often provides a more substantial gain in performance compared to compressing the source data. Consider using a splittable algorithm such as BZip2 for very large files so that they can be decompressed in parallel by multiple tasks. Use the default file extension for the files if possible. This allows HDInsight to detect the file type and automatically apply the correct decompression algorithm. If you use a different file extension you must set the io.compression.codec property for the job or for the cluster to indicate the codec used.
For more information about choosing a compression codec and compressing data see the Microsoft White Paper Compression in Hadoop. For more information about using the compression codecs in Hadoop see the documentation for the CompressionCodecFactory and CompressionCodec classes on the Apache website. Techniques for Compressing Source Data To compress your data, you can usually use the tools provided by the codec supplier. For example, the downloadable libraries for both GZip and BZip2 include tools that run on Windows and can help you apply compression. For more details see the distribution sources for GZip and BZip2 on Source Forge. You can also use the classes in the.NET Framework to perform GZip and DEFLATE compression on your source files, perhaps by writing command line utilities that are executed as part of an automated upload and processing sequence. For more details see the System.IO.Compression.GZipStream and System.IO.Compression.DeflateStream reference sections on MSDN. Another alternative is to use the tools included with Hadoop. For example, you can create a query job that is configured to write output in compressed form using one of the built-in codecs, and then execute the job against existing uncompressed data in storage so that it selects all or some part of the source data and writes it back to storage in compressed form. For an example of using Hive to do this see the Microsoft White Paper Compression in Hadoop. Summary This chapter described the typical patterns and techniques you can use to load data into a Big Data solution, focusing particularly on loading a cluster in Windows Azure HDInsight, and the tools available to do this. It described interactive and automated approaches, real-time data stream capture, and some options for loading relational data and server log files. The chapter also explored some techniques for creating your own custom utilities to upload data, using third party tools and libraries, and additional considerations such as compressing the source data. In the next chapter you will see how you can execute queries and transformations on the data you have loaded into your cluster. More Information For more information about HDInsight, see the Windows Azure HDInsight web page. For a guide to uploading data to HDInsight, and some of the tools available to help, see How to Upload Data to HDInsight on the Windows Azure HDInsight website. For more details of how HDInsight uses Windows Azure blob storage, see Using Windows Azure Blob Storage with HDInsight on the Windows Azure HDInsight website. For more information about the Microsoft .NET SDK for Hadoop, see the Codeplex project site.
Chapter 4: Performing Queries and Transformations After you have loaded your data into HDInsight you can begin generating the query results for analysis. This chapter describes the techniques and tools that you can use in Windows Azure HDInsight to create queries and transformations, and gives a broad overview of how they work and how you can use them. It will help you to understand the options you have, and to choose the most appropriate ones. The chapter also explores the two subsequent solution development stages of evaluating the results and tuning the solution. Challenges Big Data frameworks offer a huge range of tools that you can use with the Hadoop core engine, and choosing the most appropriate can be difficult. Matching compatible versions that will work well together is particularly challenging. However, Windows Azure HDInsight simplifies the process because all of the tools it includes are guaranteed to be compatible and work correctly together. That doesnt mean you cant incorporate other tools and frameworks in your solution, but doing so may involve considerable effort installing them and integrating them with your process. A second common challenge is in deciding which query tools available in HDInsight are the best choices for your specific tasks. At the time of writing, early in the evolution stage of HDInsight, the most popular query tool was Hive; around 80% of queries are written in HiveQL. No doubt this is partly due to familiarity with SQL-like languages. However, there are good reasons, as you will discover in this chapter, to consider using Pig or writing your own map and reduce components in some circumstances. Of course, creating the query is only one part of an HDInsight solution. You must consider what you want to do with the data you extract, and decide on the appropriate format for this data. How will you analyze it? What tools will you use to consume it? Do you need to load it into another database system for further processing, or save it to Windows Azure blob storage for use in another HDInsight query? The answers to these questions will help you to select the appropriate output format, and also contribute to the decision on which HDInsight query tool to use. The challenges dont end with simply writing and running a query. Its vitally important to check that the results are realistic, valid, and useful before you invest a lot of time and effort (and cost) in developing and extending your solution. As you saw in Chapter 1, a common use of Big Data techniques is simply to experiment with data to see if it can offer insights into previously undiscovered information. As with any investigational or experimental process, you need to be convinced that each stage is producing results that are both valid (otherwise you gain nothing from the answers) and useful (in order to justify the cost and effort). HDInsight is a great tool for experimenting with data to find hidden insights, but you must be sure that the results you get from it are accurate and valid, as well as being useful. After you are convinced of the validity of the process and the results you can consider incorporating the process into your organizations BI systems. Finally, when you are convinced that your queries are producing results that are both valid and useful, you must decide whether you want to repeat the process on a regular basis. At this point you should consider whether you need to fine tune the queries to maximize performance, to minimize load on the cluster, or simply to reduce the time they take to run. You may even decide to fine tune your queries during the experimentation stage if they are taking a long time to run, even with a subset of data, and you need the results more quickly. This chapter addresses all of the challenges described here. It covers: Processing the data. Evaluating the results. Tuning the solution.
Processing the Data As you saw in Chapter 1, all HDInsight processing consists of a map job in which multiple cluster nodes process their own subset of the data, and a reduce job in which the results of the map jobs are consolidated to create a single output. You can implement map/reduce code to perform the specific processing required for your data, or you can use a query interface such as Pig or Hive to write queries that are translated to map/reduce jobs and executed. Hive is the most commonly used query tool in HDInsight. The SQL-like syntax of queries is familiar and the table format of the output is ideal for use in most scenarios. However, for complex querying processes you might need to execute a Pig script or map/reduce components first and then layer a Hive table on top of the results. While Hive is the most commonly used tool in HDInsight when this guide was being written, many HDInsight processing solutions are actually incremental in nature; they consist of multiple queries, each operating on the output of the previous one. These queries may use different query tools; for example, you might first use a custom map/reduce job to summarize a large volume of unstructured data, and then create a Pig script to restructure and group the data values produced by the initial map/reduce job. Finally, you might create Hive tables based on the output of the Pig script so that client applications such as Excel can easily consume the results. Alternatively, you may create a query that consists just of a series of cascading custom map/reduce jobs.
In this section you will see descriptions of these three HDInsight query techniques in order of increasing complexity, and simple examples of using them. The three approaches are: Using Hive by writing queries in HiveQL. Using Pig by writing queries in Pig Latin. Writing custom map and reduce components, including using the streaming interface and writing components in .NET languages.
Typically, you will use the simplest of these approaches that can provide the results you require. It may be that you can achieve these results just by using Hive, but for more complex scenarios you may need to use Pig or even write your own map and reduce components. You may also decide, after experimenting with Hive or Pig, that custom map and reduce components can provide better performance by optimizing the processing. You will see more on this topic later in the chapter. Youll also see information in the following sections about: How to choose the appropriate query type for different scenarios. How you can create and use a user-defined function (UDF) in a query. How you can unify and stabilize query jobs by using HCatalog.
Processing Data with Hive Fundamentally, all HDInsight processing is performed by running map/reduce batch jobs. Hive is an abstraction layer over map/reduce that provides a query language called HiveQL, which is syntactically very similar to SQL and supports the ability to create tables of data that can be accessed remotely through an ODBC connection. In effect, Hive enables you to create an interface layer over map/reduce that can be used in a similar way to a traditional relational database. Business users can use familiar tools such as Microsoft Excel and SQL Server Reporting Services to consume data from HDInsight in a similar way as they would from a database system such as SQL Server. Installing the ODBC driver for Hive on a client computer enables users to connect to an HDInsight cluster and submit HiveQL queries that return data to an Excel worksheet, or to any other client that can consume results through ODBC. Hive uses tables to impose a schema on data, and to provide a query interface for client applications. The key difference between Hive tables and those in traditional database systems, such as SQL Server, is that Hive adopts a schema on read approach that enables you to be flexible about the specific columns and data types that you want to project onto your data. You can create multiple tables with different schema from the same underlying data, depending on how you want to use that data. You can also create views and indexes over Hive tables, and partition tables to improve query performance. Amongst its more usual use as a querying mechanism, Hive can be used to create a simple data warehouse containing table definitions applied to data that you have already processed into the appropriate format. Windows Azure storage is relatively inexpensive, and so this is a good way to create a commodity storage system when you have huge volumes of data. Hive is a good choice for data processing when: The source data has some identifiable structure, and can easily be mapped to a tabular schema. The processing you need to perform can be expressed effectively as HiveQL queries. You want to create a layer of tables through which business users can easily query source data, and data generated by previously executed map/reduce jobs or Pig scripts. You want to experiment with different schemas for the table format of the output.
You can use the Hive command line tool on the HDInsight cluster to work with Hive tables, or you can use the Hive interface in the HDInsight interactive console. You can also use the Hive ODBC driver to connect to Hive from any ODBC-capable client application, and build an automated solution that includes Hive queries by using the .NET SDK for HDInsight and with a range of Hadoop-related tools such Oozie and WebHCat. For more information about automating an HDInsight solution, see Chapter 6 of this guide. Creating Tables with Hive You create tables by using the HiveQL CREATE TABLE statement, which in its simplest form looks similar to the equivalent statement in Transact-SQL. You specify the schema in the form of a series of column names and types, and the type of delimiter in the data that Hive will use to identify each column value. You can also specify the format for the files the table data will be stored in if you do not want to use the default format (where data files are delimited by an ASCII code 1 (Octal \001) character, equivalent to Ctrl + A). For example, the following code sample creates a table named mytable and specifies that the data files for the table should be tab- delimited. HiveQL CREATE TABLE mytable (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; Hive supports a sufficiently wide range of data types to suit almost any requirement. The primitive data types you can use for column in a Hive table are TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, DECIMAL (though the last three of these are not available in all versions of Hive). In addition to these primitive types you can define columns as ARRAY, MAP, STRUCT, and UNIONTYPE. For more information about Hive data types, see Hive Data Types in the Apache Hive language manual. Managing Hive Table Data Location and Lifetime Hive tables are simply metadata definitions imposed on data in underlying files. By default, Hive stores table data in the user/hive/warehouse/table_name path in storage (the default path is defined in the configuration property hive.metastore.warehouse.dir), so the previous code sample will create the table metadata definition and an empty folder at user/hive/warehouse/mytable. When you delete the table by executing the DROP TABLE statement, Hive will delete the metadata definition from the Hive database and it will also remove the user/hive/warehouse/mytable folder and its contents. However, you can specify an alternative path for a table by including the LOCATION clause in the CREATE TABLE statement. The ability to specify a non-default location for the table data is useful when you want to enable other applications or users to access the files outside of Hive. This allows data to be loaded into a Hive table simply by copying data files of the appropriate format into the folder. When the table is queried, the schema defined in its metadata is automatically applied to the data in the files. An additional benefit of specifying the location is that this makes it easy to create a table for data that already exists in the folder (perhaps the output from a previously executed map/reduce job or Pig script). After creating the table, the existing data in the folder can be retrieved immediately with a HiveQL query. However, one consideration for using an existing location is that, when the table is deleted, the folder it references will also be deleted even if it already contained data files when the table was created. If you want to manage the lifetime of the folder containing the data files separately from the lifetime of the table, you must use the EXTERNAL keyword in the CREATE TABLE statement to indicate that the folder will be managed externally from Hive, as shown in the following code sample. HiveQL CREATE EXTERNAL TABLE mytable (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/mydata/mytable';
This ability to manage the lifetime of the table data separately from the metadata definition of the table means that you can create several tables and views over the same data, but each can have a different schema. For example, you may want to include fewer columns in one table definition to reduce the network load when you transfer the data to a specific analysis tool, but have all of the columns available for another tool. As a general guide you should: Use INTERNAL tables (the default, commonly referred to as managed tables) when you want Hive to manage the lifetime of the table or when the data in the table is temporary; for example, when you are running experimental or one-off queries over the source data. Use INTERNAL tables and also specify the LOCATION for the data files when you want to access the data files from outside of Hive; for example, if you want to upload the data for the table directly into the Azure Storage Vault (ASV). Use EXTERNAL tables when you want to manage the lifetime of the data, when data is used by processes other than Hive, or if the data files must be preserved when the table is dropped. However, notice that cannot use EXTERNAL tables when you implicitly create the table by executing a SELECT query against an existing table.
The important point to remember is that a Hive table is simply a metadata schema that is imposed on the existing data in underlying files. Depending on how you declare the table, you can have Hive manage the lifetime of the data or you can elect to manage it yourself. Therefore its important to understand how the LOCATION and EXTERNAL parameters in a Hive query affect the way that the data is stored, and how its lifetime is linked to that of the table definition. Loading Data into Hive Tables In addition to simply uploading data into the location specified for a Hive table you can use the HiveQL LOAD statement to load data from an existing file into a Hive table. When the LOCAL keyword is included, the LOAD statement moves the file from its current location to the folder associated with the table. An alternative is to use a CREATE TABLE statement that includes a SELECT statement to query an existing table. The results of the SELECT statement are used to create data files for the new table. When you use this technique, the new table must be an internal table. Similarly, you can use an INSERT statement to insert the results of a SELECT query into an existing table. You can use the OVERWRITE keyword with both the LOAD and INSERT statements to replace any existing data with the new data being inserted. As a general guide you should: Use the LOAD statement when you need to create a table from the results of a map/reduce job or a Pig script. These scripts generate log and status files alongside the output file, and using the LOAD method enables you to easily add the output data to a table without having to deal with the additional files that you do not want to include in the table. Use the INSERT statement when you want to load data from an existing table into a different table. A common use of this approach is to upload source data into a staging table that matches the format of the source data (for example, tab-delimited text). Then, after verifying the staged data, compress and load it into a table for analysis, which may be in a different format such as a SEQUENCE FILE. Use a SELECT query in a CREATE TABLE statement to generate the table dynamically when you just want simplicity and flexibility. You do not need to know the column details to create the table, and you do not need to change the statement when the source data or the SELECT statement changes. You cannot, however, create an EXTERNAL table this way and so you cannot control the data lifetime of the new table separately from the metadata definition.
To compress data as you insert it from one table to another, you must set some Hive parameters to specify that the results of the query should be compressed, and specify the compression algorithm to be used. For example, the following HiveQL statement loads compressed data from a staging table into another table. HiveQL SET mapred.output.compression.type=BLOCK; SET hive.exec.compress.intermediate=true; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
FROM stagingtable s INSERT INTO TABLE mytable SELECT s.col1, s.col2;
For more information about creating tables with Hive, see Hive Data Definition Language on the Apache Hive site. For a more detailed description of using Hive see Hive Tutorial. Querying Tables with HiveQL After you have created tables and loaded data files into the appropriate locations you can query the data by executing HiveQL SELECT statements against the tables. As with all data processing on HDInsight, HiveQL queries are implicitly executed as map/reduce jobs to generate the required results. HiveQL SELECT statements are similar to SQL, and support common operations such as JOIN, UNION, and GROUP BY. For example, you could use the following code to query the mytable table described earlier. HiveQL SELECT col1, SUM(col2) AS total FROM mytable GROUP BY col1; When designing an overall data processing solution with HDInsight, you may choose to perform complex processing logic in custom map/reduce components or Pig scripts and then create a layer of Hive tables over the results of the earlier processing, which can be queried by business users who are familiar with basic SQL syntax. However, you can use Hive for all processing, in which case some queries may require logic that is not possible to define in standard HiveQL functions. In addition to common SQL semantics, HiveQL supports the inclusion of custom map/reduce scripts embedded in a query through the MAP and REDUCE clauses, as well as custom user- defined functions (UDFs) that are implemented in Java, or that call Java functions available in the existing installed libraries. This extensibility enables you to use HiveQL to perform complex transformations on data as it is queried. UDFs are discussed in more detail later in this chapter.
Hive is easy to use, and yet is surprisingly powerful when you take advantage of its ability to use functions in Java code libraries, user-defined functions, and custom map and reduce components. To help you decide on the right approach, consider the following guidelines: If the source data must be extensively transformed using complex logic before being consumed by business users, consider using custom map/reduce components or Pig scripts to perform most of the processing, and create a layer of Hive tables over the results to make them easily accessible from client applications. If the source data is already in an appropriate structure for querying, and only a few specific but complex transforms are required by an experienced analyst with the necessary programming skills, consider using map/reduce scripts embedded in HiveQL queries to generate the required results. If queries will be created mostly by business users, but some complex logic is still regularly required to generate specific values or aggregations, consider encapsulating that logic in custom UDFs because these will be simpler for business users to include in their HiveQL queries than a custom map/reduce script.
For more information about selecting data from Hive tables, see Language Manual - Select on the Apache Hive website. Maximizing Query Performance in Hive There are several techniques you can use to maximize query performance when using Hive. These techniques fall into two distinct areas: The design of the tables. For example, partitioning the data files so that you minimize the amount of data that a query must handle, and using indexes to reduce the time taken for lookups required by a query. The design of the queries that you execute against the tables.
The following sections of this chapter explore these topics in more detail. Also see the section Tuning the Solution later in this chapter that describes a range of techniques for optimizing query performance through configuration. Improving Query Performance by Partitioning the Data Advanced options when creating a table include the ability to partition, skew, and cluster the data across multiple files and folders: You can use the PARTITIONED BY clause to create a subfolder for each distinct value in a specified column (for example, to store a file of daily data for each date in a separate folder). Partitioning can improve query performance because Hive will only scan relevant partitions in a filtered query. You can use the SKEWED BY clause to create separate files for each row where a specified column value is in a list of specified values. Rows with values not listed are stored together in a separate single file. You can use the CLUSTERED BY clause to distribute data across a specified number of subfolders (described as buckets) based on hashes of the values of specified columns. Partitioning your Hive tables is an easy way to ensure maximum query performance, especially if the data can be distributed evenly across the partitions (or buckets). It is particularly useful when you need to join tables, and the joining column is one of the partition columns. When you partition a table, the partitioning columns are not included in the main table schema part of the CREATE TABLE statement; instead they must be included in a separate PARTITIONED BY clause. The partitioning columns can still, however, be referenced in SELECT queries. For example, the following HiveQL statement creates a table in which the data is partitioned by a string value named partcol1. HiveQL CREATE EXTERNAL TABLE mytable (col1 STRING, col2 INT) PARTITIONED BY (partcol1 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/mydata/mytable'; When data is loaded into the table, subfolders are created for each partition column value. For example, you could load the following data into the table. col1 col2 partcol1 ValueA1 1 A ValueA2 2 A ValueB1 2 B ValueB2 3 B After this data has been loaded into the table, the /mydata/mytable folder will contain a subfolder named partcol1=A and a subfolder named partcol1=B, and each subfolder will contain the data files for the values in the corresponding partitions. When you need to load data into a partitioned table you must include the partitioning column values. If you are loading a single partition at a time, and you know the partitioning value, you can specify explicit partitioning values as shown in the following HiveQL INSERT statement. HiveQL FROM staging_table s INSERT INTO mytable PARTITION(partcol1='A') SELECT s.col1, s.col2 WHERE s.col3 = 'A'; Alternatively, you can use dynamic partition allocation so that Hive creates new partitions as required by the values being inserted. To use this approach you must enable the non-strict option for the dynamic partition mode, as shown in the following code sample. HiveQL SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.dynamic.partition = true; FROM staging_table s INSERT INTO mytable PARTITION(col3) SELECT s.col1, s.col2, s.col3; Improving Query Performance with Indexes If you intend to use HDInsight as a Hive-based data warehouse for Big Data analysis and reporting you can improve the performance of HiveQL queries by implementing indexes in the Hive tables. Hive supports a pluggable architecture for index handlers, enabling developers to create custom indexing solutions. By default, a single reference implementation of an index handler named CompactIndexHandler is included in HDInsight, and you can use this to index commonly queried columns in your Hive tables. The CompactIndexHandler reference implementation creates an index that stores the HDFS block address for each indexed value, and it improves the performance of queries when a table contains multiple adjacent rows with the same indexed value. To create an index you use the HiveQL CREATE INDEX statement and specify the index handler to be used, as shown in the following code sample. HiveQL CREATE INDEX idx_myindex ON TABLE mytable(col1) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD; The DEFERRED REBUILD clause in the previous code causes the index to be created, but it does not populate it with values. To build the index you use the ALTER INDEX statement, as shown in the following code sample. You should rebuild the index after each data load. HiveQL ALTER INDEX idx_myindex ON TABLE mytable REBUILD; After the index has been built you can view the index files in the /hive/warehouse folder, where a folder is created for each index. By default, indexes are partitioned using the same partitioning columns as the tables on which they are based, and the folder structure for the index reflects this partitioning. For an example of using partitioned tables and indexes, see Scenario 4: Enterprise BI Integration in Chapter 11 of this guide. Improving Performance by Optimizing the Query Hive will automatically attempt to generate the most efficient code for the map and reduce components that are executed when you run the query. However, there are some ways that you can help Hive to generate the most performant code: Avoid using a GROUP BY statement in your queries if possible. This statement forces Hadoop to use a single instance of the reducer, which can cause a bottleneck in the query processing. Instead, if the target for the results is a visualization and analysis tool (such as Microsoft Excel) use the features of that tool to group and sort the results. Alternatively, run a second query that just groups the data on the results of the original query, so that it operates on a smaller dataset and you can then accurately determine the time the group operation takes and decide whether to include it or not. If you must use GROUP BY, consider using the mapred.reduce.tasks configuration setting to increase the number of reducers, which can offset the performance impact of the grouping operation. If you need to join tables in the query, consider if you can use a MAP JOIN to minimize searches across different instances of the mapper. For more details, see the section Resolving Performance Limitations by Optimizing the Query later in this chapter. During development, prefix the query with the EXPLAIN keyword. Hive will generate and display the execution plan for the query that shows the abstract dependency tree, the dependency graph, and details of each map/reduce stage in the query. This can help you to discover where optimization may improve performance.
Processing Data with Pig Pig is a query interface that provides a workflow semantic for processing data in HDInsight. Pig provides a powerful abstraction layer over map/reduce, and enables you to perform complex processing of your source data to generate output that is useful for analysis and reporting. As well as using it to call custom map/reduce code, you can use Pig statements to execute implicit map/reduce jobs that are generated by the Pig interpreter on your behalf. Pig statements are expressed in a language named Pig Latin, and generally involve defining relations that contain data, either loaded from a source file or as the result of a Pig Latin expression on an existing relation. Relations can be thought of as result sets, and can be based on a schema (which you define in the Pig Latin statement used to create the relation) or can be completely unstructured. Pig Latin syntax has some similarities to SQL, and encapsulates many functions and expressions that make it easy to create a sequence of complex map/reduce jobs with just a few lines of simple code. However, the syntax of Pig Latin can be complex for non-programmers to master. Pig Latin is not as familiar or as easy to use as HiveQL, but Pig can achieve some tasks that are difficult, or even impossible, when using Hive. Pig Latin is great for creating relations and manipulating sets, and for working with unstructured source data. You can always create a Hive table over the results of a Pig Latin query if you want table format output. You can run Pig Latin statements interactively in the interactive console or in a command line Pig shell named Grunt. You can also combine a sequence of Pig Latin statements in a script that can be executed as a single job, and use user-defined functions you previously uploaded to HDInsight. The Pig Latin statements are used to generate map/reduce jobs by the Pig interpreter, but the jobs are not actually generated and executed until you call either a DUMP statement (which is used to display a relation in the console, and is useful when interactively testing and debugging Pig Latin code) or a STORE statement (which is used to store a relation as a text file in a specified folder). You can build an automated solution that includes Pig queries by using the .NET SDK for HDInsight, and with a range of Hadoop-related tools such Oozie and WebHCat. For more information about automating queries, see in Chapter 6 of this guide. Pig scripts generally save their results as text files in storage, where they can easily be viewed interactively; for example, in the HDInsight interactive JavaScript console. However, the results can be difficult to consume from client applications or processes unless you copy the output files and import them into client tools such as Microsoft Excel. Pig is a good choice when you need to: Restructure source data by defining columns, grouping values, or converting columns to rows. Use a workflow-based approach to process data as a sequence of operations, which is often a logical way to approach many map/reduce tasks.
Executing a Pig Script As an example of using Pig, suppose you have a tab-delimited text file containing source data similar to the following example. Data Value1 1 Value2 3 Value3 2 Value1 4 Value3 6 Value1 2 Value2 8 Value2 5 You could process the data in the source file with the following simple Pig Latin script. Pig Latin A = LOAD '/mydata/sourcedata.txt' USING PigStorage('\t') AS (col1, col2:long); B = GROUP A BY col1; C = FOREACH B GENERATE group, SUM(A.col2) as total; D = ORDER D BY total; STORE D INTO '/mydata/results'; This script loads the tab-delimited data into a relation named A; imposing a schema that consists of two columns: col1, which uses the default byte array data type, and col2, which is a long integer. The script then creates a relation named B in which the rows in A are grouped by col1, and then creates a relation named C in which the col2 value is aggregated for each group in B. After the data has been aggregated, the script creates a relation named D in which the data is sorted based on the total that has been generated. D is then stored as a file in the /mydata/results folder, which contains the following text. Data Value1 7 Value2 8 Value3 16 For more information about Pig Latin syntax, see Pig Latin Reference Manual 2 on the Apache Pig website. For a more detailed description of using Pig see Pig Tutorial. Maximizing Query Performance in Pig Pig supports a range of optimization rules. Some are mandatory and cannot be turned off, but you can include certain information in your code to adjust the behavior and fine tune the query performance. Some of the ways that you can help Pig to generate the most performant code are: Reduce the amount of data in the query pipeline by filtering columns or splitting projection operations. Consider how you can minimize the volume of output from the map phase, such as removing null values and values that are not required from the output. Specify the most appropriate data types in your query. For example, if you do not specify a data type for a numeric value Pig will use the Double type. You may be able to reduce the size of the dataset by using an Integer or Long data type instead. The Pig optimizer will attempt to combine operations into fewer statements, but its good practice to do this yourself as you write the query. For example, avoid generating a set of values in one statement and passing the set to the next statement if you can generate the required values within the first statement. Also avoid creating a set of partial strings from a column in one statement and then passing the set to a FOREACH statement if you can calculate the value inside the FOREACH loop instead. When joining datasets using a default join (not a replicated, merge, or skewed join) put the dataset containing the most items last in the JOIN statement. The data in the last dataset listed in a JOIN statement is streamed instead of being loaded into memory, which can reduce record spills to disk. Use the LIMIT statement to reduce the number of items to be processed if you do not need all of the results. The query optimizer automatically applies the LIMIT operation at the earliest possible stage in the processing. Use the DISTINCT statement in preference to GROUP BY and GENERATE where possible because it is much more efficient when extracting unique values from a dataset. Consider using the multi-query feature of Pig to split the data into branches that can be processed in parallel as non-blocking tasks. This can reduce the amount of data that must be stored, and also allows you to generate multiple outputs from a single job. For more information, see Multi-query Performance on the Apache Pig website. Consider specifying the number of reducers for a task using a PARALLEL clause. The default is one, and you may be able to improve performance by specifying a larger number.
A full discussion of optimization techniques is outside the scope of this guide, but useful lists are available in the Pig Cookbook and in the section Performance and Efficiency on the Apache Pig website. Also see the section Tuning the Solution later in this chapter, which describes a range of techniques for optimizing query performance through configuration. Writing Map/Reduce Code Map/reduce code consists of two functions; a mapper and a reducer. The mapper is run in parallel on multiple cluster nodes, each node applying it to its own subset of the data. The output from the mapper function on each node is then passed to the reducer function, which collates and summarizes the results of the mapper function. You can implement map/reduce code in Java (which is the native language for map/reduce jobs in all Hadoop distributions) or in a number of other supported languages, including JavaScript, Python, C# and F#, through Hadoop Streaming. You can also build an automated solution that includes Hive queries by using the .NET SDK for HDInsight, and with a range of Hadoop-related tools such Oozie and WebHCat. You dont need to know Java to use HDInsight. The high-level abstractions such as Hive and Pig will create the appropriate mapper and reducer components for you, or you can write map and reduce components in other languages that you are more familiar with. In most HDInsight processing scenarios it is simpler and more efficient to use a higher-level abstraction such as Pig or Hive to generate the required map/reduce code on your behalf. However, you might want to consider creating your own map/reduce code when: You want to process data that is completely unstructured by parsing it and using custom logic to obtain structured information from it. You want to perform complex tasks that are difficult (or impossible) to express in Pig or Hive without resorting to creating a UDF; for example, using an external geocoding service to convert latitude and longitude coordinates or IP addresses in the source data to geographical location names. You already have an algorithm that you want to apply to the code, or you want to reuse some of the same code (such as the same mapper or reducer component) in different queries. You want to reuse your existing .NET, Python, or JavaScript code in map/reduce components. This requires you to use the Hadoop streaming interface (described later in this chapter). You want to be able to refine and exert full control over the query execution process, such as using a combiner in the map phase to reduce the size of the map process output.
For more information about using languages other than Java to write map/reduce components, and using the Hadoop Streaming interface, see the section Using Hadoop Streaming and .NET Languages later in this chapter. For more information about automating map/reduce queries, see Chapter 6 of this guide. Creating Map and Reduce Components The following code sample shows a commonly referenced JavaScript map/reduce example that counts the words in a source that consist of unstructured text data. JavaScript var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; The map function in this script splits the contents of the text input value into an array of strings (using anything that is not an alphabetic character as a delimiter). Each string in the array is then used as a key with an associated value of 1. Each key generated by the map function is passed to the reduce function along with its values, which are added together to determine the total number of times the key string was included in the source data. In this way, the two functions work together to create a list of the distinct strings in the source data with the total count of each string, as shown here. Data Aardvark 2 About 7 Above 12 Action 3 ... For more information about writing map/reduce code, see MapReduce Tutorial on the Apache Hadoop website. Using Hadoop Streaming and .NET Languages The Hadoop core within HDInsight supports a technology called Hadoop Streaming that allows you to interact with the map/reduce process and run your own code outside of the Hadoop core. This code replaces the map/reduce functionality that would usually be implemented using Java components. You can use streaming when executing a custom map/reduce job or when executing jobs in Hive and Pig. Figure 1 shows an overview of the way that streaming works.
Figure 1 High level overview of Hadoop Streaming When using Hadoop Streaming, each node passes the data for the map part of the process to the specified code or component as a stream, and accepts the results from the code as a stream, instead of internally invoking a mapper component written in Java. In the same way, the node(s) that execute the reduce process pass the data as a stream to the specified code or component, and accept the results from the code as a stream, instead of internally invoking a Java reducer component. Streaming has the advantage of decoupling the map/reduce functions from the Hadoop core, allowing almost any type of components to be used to implement the mapper and the reducer. The only requirement is that the components must be able to read from and write to the standard input and output (stdin and stdout). You might want to avoid using Hadoop Streaming unless it is absolutely necessary due to the inherent reduction in job performance that it causes. At the time of writing, this means that you may decide to avoid using non-Java languages to create map/reduce code unless there is no alternative. The main disadvantages with using the streaming interface are the effect it has on performance, and the fact that the data must be in text form. The additional movement of the data over the streaming interface can slow overall query execution time quite considerably, and so this approach is typically used only when there is no realistic alternative. Streaming tends to be used mostly to enable the creation of map and reducer components in languages other than Java. This includes using .NET languages such as C# and F# with HDInsight. For more details see Hadoop Streaming on the Apache website. For information about writing HDInsight map/reduce jobs in languages other than Java, see Hadoop on Azure C# Streaming Sample Tutorial and Hadoop Streaming Alternatives. Using Hadoop Streaming in HDInsight If you decide that you must write your map/reduce code using a .NET language such as C#, you can simplify the task by taking advantage of the .NET SDK for Hadoop. This contains a set of helper classes for creating map/reduce components, and simplifies the work required to connect all the required components with HDInsight when executing a job. There is also a set of classes in the MapReduce library that you can use to abstract many parts of the process. Figure 2 shows the relevant parts of the .NET SDK for Hadoop.
Figure 2 Using the .NET SDK for Hadoop to create non-Java map/reduce code For information about using the SDK see Using the Hadoop .NET SDK with the HDInsight Service. Maximizing Map/Reduce Code Performance Many techniques for maximizing the runtime performance of map/reduce components are the same as for any application you create. These techniques include good practice use of logical operators, comparison statements, and execution flow to create efficient code. Some of the other ways that you can maximize performance are: Use Hadoop Streaming only when necessary, such as when writing map/reduce components using shell scripts or in languages other than Java. Hadoop Streaming typically reduces the overall performance of the job, though this may be an acceptable tradeoff if you actually need to use a language other than Java. Minimize storage I/O, memory use, and network bandwidth load by controlling the volume of data that is generated at each stage during the map and reduce operations. For example, consider generating the minimum required volume of output from each mapper, such as removing null values or values that are not required from the output. Where possible create a combiner component that can be run on every map node to perform a preliminary aggregation of the results from that map process. You may be able to use the same reducer component as both a combiner and in the reduce phase of the job. Choose the most appropriate data types in your code. Avoid using Text types where this can be avoided, and use the most compact numeric types to suit your data (for example, use integer types instead of long types where this is appropriate). If you are using Sequence Files for your jobs, consider using the built-in Hadoop data types instead of the language-specific types. Hadoop types such as IntWritable and LongWritable typically provide better performance with Sequence Files, and the numeric types VIntWritable and VLongWritable automatically persist the values using the least number of bytes. Try to avoid creating new instances of objects or types (including the basic data types) within a loop or repeated section of code. Instead create the instance outside of the repeated section and reuse it in all iterations of that section. Avoid string manipulation and number formatting if possible because these operations are relatively slow in Hadoop. Use the StringBuffer class instead of string concatenation.
Also see the section Tuning the Solution later in this chapter, which describes a range of techniques for optimizing query performance through configuration. Choosing a Query Mechanism The previous sections described how Hive, Pig, and custom map and reduce components can be used. The following table summarizes the advantages and considerations for each one, and will help you to choose the approach most appropriate for your requirements. Query mechanism Advantages Considerations Hive using HiveQL It uses a familiar SQL-like syntax. It can be used to produce persistent tables of data that can easily be partitioned and indexed. Multiple external tables and views can be created over the same data. It supports a simple data warehouse implementation. It requires the source data to have at least some identifiable structure. Even though it is highly optimized, it might not produce the most performant map/reduce code. It might not be able to carry out some types of complex processing tasks. Pig using Pig Latin An excellent solution for manipulating data as sets, and for restructuring data by defining columns, grouping values, or converting columns to rows. It can use a workflow-based approach as a sequence of operations on data. It allows you to fine tune the query to maximize performance or minimize server and network load. Pig Latin is less familiar and more difficult to use than HiveQL. The default output is usually a text file, rather than the table-format file generated by Hive, and so is more difficult to use with visualization tools such as Excel. Not generally suited to ETL operations unless you layer a Hive table over the output. Custom map/reduce components It provides full control over the map and reduce phases and execution. It allows queries to be optimized to achieve maximum performance from the cluster, or to minimize the load on the servers and the network. Map and reduce components can be written in a range of widely known languages with which developers are likely to be familiar. It is more difficult than using Pig or Hive because you must create your own map and reduce components. Processes that require the joining of sets of data are more difficult to achieve. Even though there are test frameworks available, debugging code is more complex than a normal application because they run as a batch job under the control of the Hadoop job scheduler. The following table shows some of the more general suggestions that will help you make the appropriate choice depending on the task requirements. Requirement Appropriate technologies Table or dataset joins, or manipulating nested data. Pig and Hive. Ad-hoc data analysis. Pig, Hadoop Streaming. SQL-like data analysis and data warehousing. Hive. Working with binary data or Sequence Files. Java map/reduce components. Working with existing Java or map/reduce libraries. Java map/reduce components. Maximum performance for large or recurring jobs. Well-designed Java map/reduce components. Using scripting or non-Java languages. Hadoop Streaming. User Defined Functions Developers often find that they reuse the same code in several locations, and the typical way to optimize this is to create a user-defined function (UDF) that can be imported into other projects when required. Often a series of UDFs that accomplish related functions are packaged together in a library so that the library can be imported into a project, and code can take advantage of any of the UDFs it contains. You can create individual UDFs and UDF libraries for use with HDInsight queries and transformations. Typically these are written in Java and compiled, although you can write UDFs in Python to use with Pig (but not with Hive or custom map/reduce code) if you prefer. The compiled UDF is uploaded to HDInsight and can be referenced and used in a Hive or Pig script, or in your map/reduce code; however UDFs are most commonly used in Hive and Pig scripts. Python can be used to create UDFs but these can only be used in Pig, and the process for creating and using them differs from using UDFs written in Java. UDFs can be used not only to centralize code for reuse, but also to perform tasks that are difficult (or even impossible) in the Hive and Pig scripting languages. For example, a UDF could perform complex validation of values, concatenation of column values based on complex conditions or formats, aggregation of rows, replacement of specific values with nulls to prevent errors when processing bad records, and much more. Creating and Using UDFs in Hive Scripts In Hive, UDFs are often referred to a plugins. You can create three different types of Hive plugin. A standard UDF is used to perform operations on a single row of data, such as transforming a single input value into a single output value. A table generating function (UDTF) is used to perform operations on a single row, but can produce multiple output rows. An aggregating function (UDAF) is used to perform operations on multiple rows and it outputs a single row. Each type of UDF extends a specific Hive class in Java. For example, the standard UDF must extend the built-in Hive class UDF or GenericUDF and accept text string values (which can be column names or specific text strings). As a simple example, you can create a UDF named your-udf-name as shown in the following code. Java package your-udf-name.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public final class your-udf-name extends UDF { public Text evaluate(final Text s) { // implementation here // return the result } } The body of the UDF, the evaluate function, accepts one or more string values. Hive passes the values of columns in the dataset to these parameters at runtime, and the UDF generates a result. This might be a text string that is returned within the dataset; or a Boolean value if the UDF performs a comparison test against the values in the parameters. The arguments must be types that Hive can serialize. After you create and compile the UDF, you upload it to HDInsight using the interactive console or a script with the following command. Interactive console command add jar /path/your-udf-name.jar; Next you must register it using a command such as the following. Hive script CREATE TEMPORARY FUNCTION local-function-name AS 'package-name.function-name'; You can then use the UDF in your Hive query or transformation. For example, if the UDF returns a text string value you can use it as shown in the following code to replace the value in the specified column of the dataset with the value generated by the UDF. Hive script SELECT your-udf(column-name) FROM your-data If the UDF performs a simple task such as reversing the characters in a string, the result would be a dataset where the value in the specified column of every row would have its contents reversed. However, the registration only makes the UDF available for the current session, and you will need to re-register it each time you connect to Hive. For more information about creating and using standard UDFs in Hive, see HivePlugins on the Apache website. For more information about creating different types of Hive UDF, see User Defined Functions in Hive, Three Little Hive UDFs: Part 1, Three Little Hive UDFs: Part 2, and Three Little Hive UDFs: Part 3 on the Oracle website. Creating and Using UDFs in Pig Scripts In Pig you can also create different types of UDF. The most common is an evaluation function that extends the class EvalFunc. The function must accept an input of type Tuple, and return the required result (which can be null). The following outline shows a simple example. Java package your-udf-package; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple;
public class your-udf-name extends EvalFunc<String> { public String exec(Tuple input) { // implementation here // return the result } } As with a Hive UDF, after you create and compile the UDF you upload it to HDInsight using the interactive console or a script with the following command. Interactive console command add jar /path/your-udf-name.jar; Alternatively, you can just copy your compiled UDF into the pig/bin folder on the cluster through a Remote Desktop connection. The REGISTER command at the start of a Pig script will then make the UDF available. You can now use the UDF in your Pig queries and transformations. For example, if the UDF returns the lower- cased equivalent of the input string, you can use it as shown in this query to generate a list of lower-cased equivalents of the text strings in the first column of the input data. Pig script REGISTER 'your-udf-name.jar'; A = LOAD 'your-data' AS (column1: chararray, column2: int); B = FOREACH A GENERATE your-udf-name.function-name(column1); DUMP B; A second type of UDF in Pig is a filter function that you can use to filter data. A filter function must extend the class FilterFunc, accept one or more values as Tuples, and return a Boolean value. The UDF can then be used to filter rows based on values in a specified column of the dataset. For example, if a UDF named IsShortString returns true for any input value less than five characters in length, you could use the following script to remove any rows where the first column has a value less than five characters. Pig script REGISTER 'your-udf-name.jar'; A = LOAD 'your-data' AS (column1: chararray, column2: int); B = FOREACH A GENERATE your-udf-name.function-name(column1); C = FILTER B BY not IsShortString(column1) DUMP C; For more information about creating and using UDFs in Pig, see the Pig UDF Manual on the Apache Pig website. Unifying and Stabilizing Query Jobs with HCatalog So far in this chapter you have seen how to use Hive, Pig, and custom map/reduce code to process data in an HDInsight cluster. In each case you use code to project a schema onto data that is stored in a particular location, and then apply the required logic to filter, transform, summarize, or otherwise process the data to generate the required results. The code must load the source data from wherever it is stored, and convert it from its current format to the required schema. This means that each script must include assumptions about the location and format of the source data. These assumptions create dependencies that can cause your scripts to break if an administrator chooses to change the location, format, or schema of the source data. Additionally, each processing interface (Hive, Pig, or custom map/reduce) requires its own definition of the source data, and so complex data processes that involve multiple steps in different interfaces require consistent definitions of the data to be maintained across multiple scripts. HCatalog provides a tabular abstraction layer that helps unify the way that data is interpreted across processing interfaces, and provides a consistent way for data to be loaded and stored; regardless of the specific processing interface being used. This makes it easier to create complex, multi-step data processing solutions that enable you to operate on the same data by using Hive, Pig, or custom map/reduce code without having to handle storage details in each script.
HCatalog doesnt change the way scripts and queries work. It just abstracts the details of the data file location and the schema so that your code becomes less fragile because it has fewer dependencies, and the solution is also much easier to administer.
An overview of the use of HCatalog as storage abstraction layer is shown in Figure 3.
Figure 3 Unifying different processing mechanisms with HCatalog
Understanding the Benefits of Using HCatalog To understand the benefits of using HCatalog, consider the typical scenario shown in Figure 4.
Figure 4 An example solution that benefits from using HCatalog In the example, Hive scripts create the metadata definition for two tables that have different schemas. The table named mydata is created over some source data uploaded as a file to HDInsight, and this defines a Hive location for the table data (step 1 in Figure 2). Next, a Pig script reads the data defined in the mydata table, summarizes it, and stores it back in the second table named mysummary (steps 2 and 3). However, in reality, the Pig script does not access the Hive tables (which are just metadata definitions). It must access the source data file in storage, and write the summarized result back into storage, as shown by the dotted arrow in Figure 2. In Hive, the path or location of these two files is (by default) denoted by the name used when the tables were created. For example, the following HiveQL shows the definition of the two Hive tables in this scenario, and the code that loads a data file into the mydata table. HiveQL CREATE TABLE mydata (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA INPATH '/mydata/data.txt' INTO TABLE mydata;
CREATE TABLE mysummary (col1 STRING, col2 BIGINT); Hive scripts can now access the data simply by using the table names mydata and mysummary. Users can use HiveQL to query the Hive table without needing to know anything about the underlying file location or the data format of that file. However, the Pig script that will group and aggregate the data, and store the results in the mysummary table, must know both the location and the data format of the files. Without HCatalog, the script must specify the full path to the mydata table source file, and be aware of the source schema in order to apply an appropriate schema (which must be defined in the script). In addition, after the processing is complete, the Pig script must specify the location associated with the mysummary table when storing the result back in storage, as shown in the following code sample. Pig Latin A = LOAD '/mydata/data.txt' USING PigStorage('\t') AS (col1, col2:long); ... ... ... STORE X INTO '/mysummary/data.txt'; The file locations, the source format, and the schema are hard-coded in the script, creating some potentially problematic dependencies. For example, if an administrator moves the data files, or changes the format by adding a column, the Pig script will fail. Using HCatalog removes these dependencies by enabling Pig to use the Hive metadata for the tables. To use HCatalog with Pig you must specify the useHCatalog parameter when starting the grunt client interface tool, and the path to the HCatalog installation files must be registered as an environment variable named HCAT_HOME. For example, you could use the following command line statements to launch the grunt interface. Command Line to start Pig SET HCAT_HOME = C:\apps\dist\hcatalog-0.4.1 Pig -useHCatalog With the HCatalog support loaded you can now use the HCatLoader and HCatStorer objects in the Pig script, enabling you to access the data through the Hive metadata instead of requiring direct access to the data file storage. Pig Latin A = LOAD 'mydata' USING org.apache.hcatalog.pig.HCatLoader(); ... ... ... STORE X INTO 'mysummary' USING org.apache.hcatalog.pig.HCatStorer();
HCatalog is built on top of the Hive metastore and uses Hive data definition language (DDL) to define tables that can be read and written to by Pig and map/reduce code through the HCatLoader and HCatStorer classes. The script stores the summarized data in the location denoted for the mysummary table defined in Hive, and so it can be queried using HiveQL as shown in the following example. HiveQL SELECT * FROM mysummary;
Deciding Whether to Adopt HCatalog The following factors will help you decide whether to incorporate HCatalog in your HDInsight solution: It makes it easy to abstract the data storage location, format, and schema from the code used to process it. It minimizes fragile dependencies between scripts in complex data processing solutions where the same data is processed by multiple tasks. It is easy to incorporate into solutions that include Pig scripts, requiring very little extra code. Additional effort is required to use HCatalog in custom map/reduce components because you must create your own custom load and store functions. If you use only Hive scripts and queries, or you are creating a one-shot solution for experimentation purposes and do not intend to use it again, HCatalog is unlikely to provide any benefit.
Evaluating the Results After you have used HDInsight to process the source data, you can use the data for analysis and reporting which forms the foundation for business decision making. However, before making critical business decisions based on the results you must carefully evaluate them to ensure that they are: Meaningful. The results when combined and analyzed make sense, and the data values relate to one another in a meaningful way. Accurate. The results are correct, with no errors introduced by calculations or transformations applied during processing. Useful. The results are applicable to the business decision they will support, and provide relevant metrics that help inform the decision making process.
You will often need to employ the services of a business user who intimately understands the business context for the data to perform the role of a data steward and sanity check the results to determine whether or not they fall within expected parameters. Depending on the complexity of the processing, you might also need to select a number of data inputs for spot-checking and trace them through the process to ensure that they produce the expected outcome. When you are planning to use HDInsight to perform predictive analysis, it can be useful to evaluate the process against known values. For example, if your goal is to use demographic and historical sales data to determine the likely revenue for a proposed retail store, you can validate the processing model by using appropriate source data to predict revenue for an existing store and comparing the resulting prediction to the actual revenue value. If the results of the data processing you have implemented vary significantly from the actual revenue, then it seems unlikely that the results for the proposed store will be reliable! It may not be possible to validate all of the source data for a query, especially if it is collected from external sources such as social media sites. A good way to measure the validity of the result is to perform the same query and analysis to predict an already known outcome, such as sales trends over an existing sales region, and then compare the result with reality. If there is a reasonable match, you can generally assume that your hypothesis is valid.
Viewing Results To evaluate the results of your HDInsight data processing solution, you must be able to view them. The way that you do this depends on the kind of result produced by your queries. For example: If the result is a dataset containing rows with multiple values, such as the output from a Hive query, you will probably need to import the data into a suitable visualization tool such as Microsoft Excel in order to check the validity of the result. Chapter 5 of this guide explains the many options available for visualizing and analyzing datasets such as this. If the result is a single value, such as a count of some specific criterion in the source data, it is easy to view by using the #cat command this in the interactive console, or by opening the result file generated by your queries in a text editor. If the result is a set of key/value pairs, you can create a simple chart within the interactive console by loading the file data and using the JavaScript graph library to visualize it. The following code shows how you can create a pie chart using this approach. Interactive JavaScript file = fs.read("./results/data") data = parse(file.data, "col1, col2:long") graph.pie(data);
These commands read the contents of the files in the ./results/data folder into a variable, parse the variable into two columns (the first using the default chararray data type, the second containing a long integer), and display the data as a pie chart in the console; as shown in Figure 5.
Figure 5 Viewing Data in the Interactive Console
Tuning the Solution After you have built an HDInsight solution for Big Data processing, you can monitor its performance and start to find ways in which you can tune the solution to reduce query processing time or hardware resource utilization. Irrespective of the way that you construct a query, the language and tools you use, and the way that you submit the query to HDInsight, the result is execution within the Hadoop core of two Java components: the mapper and the reducer. Figure 6 shows the stages of the overall process that occur on a data node in the cluster when it executes both the map and reduce phase of the query (not all servers will execute the reduce phase).
Figure 6 The stages of the map and reduce phase on a data node in an HDInsight cluster The stages shown in Figure 6 are as follows. The Hadoop engine on each data node server in the cluster loads the data from its local store into memory and splits it into segments. The mapper component executes on each data node server. It scans the data segments and generates output that it stores in another memory buffer. As the map process on each data node server completes, the Hadoop engine sorts and partitions these intermediate results and writes them to storage. The name node server determines which nodes will execute the reduce algorithm. On each of the selected servers the Hadoop engine loads the intermediate results from storage into a memory buffer, combines them with the intermediate results from other servers in the cluster that are not executing the reduce phase, shuffles and sorts them, and stores them in another memory buffer. The reducer component on the selected data node servers executes the reduce algorithm, and then writes the results to an output file.
Its clear from this that the performance of the cluster when executing a query depends on several factors: The efficiency of the storage system and its connectivity to the server. This affects the rate at which data can be read from storage into memory, and results written back to storage. Memory availability. The process requires a number of in-memory buffers to hold intermediate results. If there is not enough physical memory available, rows are spilled to disk for temporary storage while the query executes. The speed of the CPU. The rate at which the mapper and reducer components, and the Hadoop engine, can execute affects the total processing time for the query. Network bandwidth and latency between the data node servers in the cluster. During the reduce phase the output from the mappers must be moved to the server(s) that are executing the reducer component.
These four factors can be summarized as storage I/O, memory, computation, and network bandwidth. Any of these could cause a bottleneck in the execution of a query, though typically storage I/O is the most common. However, you cannot assume this is the case. If you intend to execute a query on a repeatable basis, and the performance is causing concern, you should attempt to tune the performance of the query by taking advantage of the comprehensive logging and monitoring capabilities of Hadoop and HDInsight.
Keep in mind that Hadoop is not designed to be a high performance query engine. Its strength is in its scalability and the consequent ability to handle huge datasets. Unlike traditional query mechanisms, such as relational database systems, the design of Hadoop allows almost infinite scale out with near linear gain in performance as nodes are added.
Typically, tuning a query is an iterative process: Use the log files generated by HDInsight to discover how long each stage of the process takes, and identify the part of the process that is most likely to cause a bottleneck. Take steps to remove or minimize the impact of the bottleneck you find. You should attempt to rectify only a single bottleneck at a time; otherwise you may not be able to measure the effects of the specific change you make. Run the query again and use the log files to determine if the change you made has removed or reduced the impact of the bottleneck. If not, try making a different change and repeat the process. If the changes you made have removed or minimized the bottleneck, look in the log files for any other areas that are limiting the efficiency of the process. If you identify another bottleneck, repeat the entire process to remove or minimize this one. Continue until you achieve acceptable query processing performance, or until you have minimized all of the bottlenecks that you can identify.
There is no point is optimizing or performance tuning a query if its a one-off that you dont intend to execute again, or if the current performance is acceptable. Optimization is a trade-off between the business value of a gain in performance against the cost and time it takes to carry out the tuning. You will probably find that a small amount of effort brings large initial gains (the low hanging fruit), but additional effort gives increasingly less performance gain with each iteration. Locating Bottlenecks in Query Execution Performance limitations can arise in any of the four main areas of storage I/O, memory, computation, and network bandwidth. In addition, each of these areas comes into play at several stages of the query process, as you saw in Figure 6. For example, its possible that insufficient memory in the server could cause rows to be spilled to disk rather than all held in memory during the map phase. However, the moderate reduction of storage I/O delay provided by using faster disks with a larger onboard cache will probably provide much less performance gain compared to adding more memory to the server so that rows do not need to be spilled. Its also possible that the existing hardware can execute the query much more quickly if you can optimize the map and reduce components so that, for example, less intermediate rows need to be buffered during the map phase which would remove the requirement to spill rows to disk with the existing memory capacity of the servers. Youll see more information about optimizing query design later in this chapter. After you identify a bottleneck, and before you start modifying configuration properties or extending the hardware by adding more nodes, think about how you might be able to optimize the query to remove or minimize the bottleneck. For example, if you write the query in a high-level language such as HiveQL you may be able to boost performance by changing the query or by splitting it into two separate queries. Accessing Performance Information Most of the performance information you will use to monitor query performance is available from two portals that are part of the Hadoop system. The Job Tracker portal is accessed on port 50030 of your clusters head node. The Health and Status Monitor is accessed on port 50070 of your clusters head node. The easiest way to use them is to open a Remote Desktop connection to the cluster from the main Windows Azure HDInsight portal. The link named Hadoop MapReduce Status located on the desktop of the remote virtual server provides access to the Job Tracker portal, while the link named Hadoop Name Node Status opens the Health and Status Monitor portal. To view the counters for a job in the Job Tracker portal scroll to the Completed Jobs section of the Map/Reduce Administration page and select the job ID. This opens the History Viewer page for that job, which displays a wealth of information. It includes the number of mapper and reducer tasks; and full details of the job counters, input and output bytes, file system reads and writes, and details of the map/reduce process. For any job you select in the History Viewer page you can view all the configuration settings that job used by selecting the link to the job configuration XML file. In addition to the combined information from all of the cluster nodes for a job, you can view history details for a job on each individual cluster node. In the Map/Reduce Administration page of the Job Tracker select the link in the Nodes column of the table that contains the Cluster Summary, and then select the node you are interested in. This opens the local Job Tracker portal on that cluster node. For more information about the configuration files in HDInsight, and how you set configuration property values, see the section Configuring HDInsight Clusters and Jobs in Chapter 7 of this guide. Identifying Storage I/O Limitations Storage I/O performance on data node servers is often the most likely candidate for limiting query performance, due simply to the physical nature of the storage mechanism. Data must be loaded from and written to storage in each map phase, together with any spilt rows, and this is usually where problems arise when there is a huge volume of data to process. Data is also loaded from and written to storage during each reduce phase, although this is usually a much smaller volume of data. Storage I/O can also be an issue on the head node of a cluster if cluster configuration and management data must be repeatedly read from disk, and updates written to disk. This may occur when there is a shortage of RAM on the head node, or there are a great many data file segments to manage. Windows Azure HDInsight uses solid- state hard disk drives (SSD) on the head node to maximize cluster management performance. To detect when storage I/O performance might slow query execution, examine the values of the ASV_BYTES_READ and HDFS_BYTES_READ counters in the job counters log for the map phase (in the History Viewer page for the job in the Job Tracker portal). These counters show the number of bytes loaded into memory at the start of the map phase, and indicate the amount of data that was loaded by the cluster at the start of query execution. The values of the FILE_BYTES_WRITTEN counter for the map phase and the FILE_BYTES_READ counter for the reduce phase indicate the amount of data that was written to disk at the end of the map phase and loaded from disk at the start of the reduce phase. The values of the HDFS_BYTES_WRITTEN counters in the job counters log for the reduce phase indicate the amount of data that was written to disk after the query completed. To minimize the effect of storage I/O limitations, consider optimizing the query, compressing the data, and adjusting the memory configuration of the cluster servers. See the later sections of this chapter for more details. Identifying Memory Limitations When you configure an HDInsight cluster, the amount of RAM available to Hadoop on each server is set to a default value that you do not need to explicitly specify. However, a shortage of RAM can slow query execution by causing records to be spilled to disk when the memory buffers are full. The additional storage I/O operations are likely to severely affect query execution times. In addition, the head node must have sufficient RAM to hold the lookup tables and other data associated with managing the cluster and the query processes on each server. To detect when memory shortage might slow query execution examine the values of the counters FILE_BYTES_READ, FILE_BYTES_WRITTEN, and Spilled Records in the job counters log (in the History Viewer page for the job in the Job Tracker portal). The value for the counter FILE_BYTES_WRITTEN for the map phase should be approximately equal to the value of FILE_BYTES_READ during the reduce phase, indicating that there was no overflow of the memory buffer. Its less common for spilled records to occur from a reduce task, but it may happen if the input to the reduce phase is very large. The same job counters as you use for the map phase provide values for the reduce phase as well. The Windows performance counters for memory usage on each server provide a useful indication of the memory limitations as the query executes. You may also see OutOfMemory errors generated as the query runs, and in the results log, when there is an extreme shortage of memory. To minimize the effect of memory limitations, consider adjusting the configuration of the cluster. See the later sections of this chapter for more details. Identifying Computation Limitations Modern CPUs are so much faster than peripheral equipment such as disk storage and (to a lesser extent) memory. Hadoop query execution speed is unlikely to be limited by CPU overload. Its more likely that CPUs on the cluster servers are underutilized, though this obviously depends on the actual CPU is use. Its possible to configure the number of map and reduce task instances that Hadoop will execute for a query by setting the mapred.map.tasks and mapred.reduce.tasks configuration properties. If there are no other bottlenecks, such as storage I/O or memory, that are limiting the performance of the query you may be able to reduce the execution time by specifying additional instances of map and reduce tasks within the cluster. The Map/Reduce Administration page of the Job Tracker portal shows the total number of slots available for instances of map and reduce tasks. The History Viewer page for each job shows the actual number of instances of each task. Note that, if you are using Hive to execute the query and that query contains an ORDER BY clause, only one reducer task instance will be executed by default. Underutilization of the CPU can also occur when you use more than one reducer task instance and the distribution of data between them is uneven. This occurs when some map tasks create much larger intermediate results sets than others. To identify this condition view the job counters for the job in the History Viewer page on each individual cluster node, and compare the values of the Reduce input records counters. To minimize the effect of computation limitations, consider adjusting the configuration of the cluster and increasing the number of data nodes in the cluster. See the later sections of this chapter for more details. Identifying Network Bandwidth Limitations Hadoop must pass data and information between nodes as the query executes. For example, the head node passes instructions to each data node to manage and synchronize the data movement operations that occur on each server, and to initiate execution of the map and reduce phases. In addition, the intermediate data from the map phase on each server must pass to the selected server(s) that will execute the reduce phase. You can determine the number of input records for the reduce phase on each server in the cluster by examining the value of the Reduce input records counter for each job (in the History Viewer page for the job in the Job Tracker portal). To minimize the effect of network bandwidth limitations, consider compressing the data and optimizing the query. See the following sections of this chapter for more details. Overview of Tuning Techniques The following table summarizes the major performance issues that you might detect for a query, and provides a pointer to the type of solution you can apply to resolve these issues. The subsequent sections of this chapter provide more information on how you can apply the techniques shown in the table. Bottleneck or issue Tuning techniques Massive I/O caused by a large volume of source data. Compress the source data. Massive I/O caused by a large number of spilled records during the map phase. Refine the query to minimize the map output, or allocate additional memory for the buffer used for sorting records in the map phase. Massive I/O or network traffic caused by a large volume of data generated by the map phase. Refine the query to minimize the map output, compress the map output, or implement a combiner that reduces the volume of the map output phase. Massive IO and network traffic caused by a large volume of data output from the final reduce phase. Compress the job output or reduce the replication factor for the cluster (the number of times data is replicated). Additionally, for a Hive query, compress the intermediate output from the map phase. Insufficient running tasks caused by improper configuration Increase the number of map/reduce tasks and slots to take full advantage of the available slots on each data node. Insufficient memory causing the job to hang. Adjust the memory allocation for the tasks. Uneven distribution of load on the reducers causing one or more to run much longer than the rest. Refine the query to reduce the skewing of the results from the map phase on each machine to provide more even data distribution. For a Hive query, enable the automatic Map-Join feature. A Hive query containing an ORDER BY clause forces use of a single reducer. Avoid global sorting by removing the ORDER BY clause and then sort the data using another query, or write the query using Pig Latin (which is much more efficient for global sorting).
Resolving Performance Limitations by Optimizing the Query Performance tuning a query is often as much about the way that the query is written as it is about the configuration and capabilities of the cluster software and hardware. Its possible, if the query is shown to be overloading parts of the hardware, that you can split it into two or more queries and persist the intermediate results between them to minimize memory usage. You may even be able to store summarized or filtered intermediate results in a form that allows them to be reused by more than one query, giving an overall gain in performance over several queries even if one takes a little longer due to storage I/O limitations. If you are using custom map and reduce components, you might consider implementing a combiner class as part of the reducer component. The combiner aggregates the output from the map phase before it passes to the reducer, which means that the reducer will operate more efficiently because the dataset is smaller. You can also use the reducer as a combiner. This means that you effectively perform the reduce phase on each data node after the map operation completes, before Hadoop collects the output from each data node and submits them to the reducer on the selected data node(s) for the final reduce phase. The values of the Combine input records and Combine output records job counters (in the History Viewer page for the job in the Job Tracker portal) show the number of records passed into and out of the combiner, which indicates how efficient the combiner is in reducing the size of the dataset passed to the reducer function. For information about creating a combiner, see Hadoop Map/Reduce Tutorial in the Apache online documentation. In most HiveQL queries the map job is performed by multiple mappers, and the reduce job is performed by a single reducer. However, you can control the number of reducers by setting the mapred.reduce.tasks property to the number of reducers you want to use, or by setting the hive.exec.reducers.bytes.per.reducer property to control the maximum number of bytes that can be processed by a single reducer. If you are using a HiveQL query that contains an ORDER BY clause, consider how you might be able to remove this clause to allow Hadoop to execute more than one reducer instance. You may be able to sort the results using a separate HDInsight query, or export the data and then sort it afterwards. HiveQL also supports joins between tables, but these can be very slow for large datasets. Instead, if one of the tables is relatively small (such as a lookup table), you can use a MAPJOIN operation to perform the join in each mapper instance so that the join no longer takes place in the reduce phase. You can also enable the Auto Map- Join feature by setting the value of the Hadoop configuration property hive.auto.convert.join to true, and specifying the size of the largest table that will be treated as a small table in the hive.smalltable.filesize property. You can set these properties dynamically for a specific query as shown in the following code sample. HiveQL SET hive.auto.convert.join=true; SET hive.smalltable.filesize=25000000;
SELECT b.col2, s.col2, FROM bigtable b JOIN smalltable s ON b.col1 = s.col1; Alternatively, you can edit these properties in the configuration file hive-site.xml in the apps\dist\hive- [version]\conf\ folder. For more information about the MAPJOIN operator see Hive Joins in the Apache documentation. For information about editing configuration settings, see the section Configuring HDInsight Clusters and Jobs earlier in this chapter. The format of the files you create for use in your queries can also have an impact on the performance of jobs. By using Sequence Files instead of text files you can often improve the efficiency and reduce the execution time for jobs. Sequence Files support the built-in Hadoop data types such as VIntWritable and VLongWritable that minimize memory usage and provide good write performance. Sequence File are also splittable, which better supports parallel processing on multiple map nodes, and they work well with block-level compression between the stages of a map/reduce job. If you have a chain of jobs you can typically output the data from the first job and any intermediate jobs in Sequence File format, and then generate the final output for storage in an appropriate format such as a text file. Resolving Performance Limitations by Compressing the Data Bottlenecks caused by storage I/O limitations can be minimized in two ways. You can optimize the query to reduce the amount of data being written to and read from disk in the map and reduce phases, as described in the previous section of this chapter. However, if this is not possible, consider compressing the data to reduce the number of bytes that must be written and read. Data compression will increase CPU load as the data is decompressed and compressed at each stage, but this is typically not an issue as storage I/O limitations are more likely than computation limitations on modern servers. You can compress the source data, the map output data, and the job output data (the output from the reduce phase). You can compress data that is used by a query by applying a codec. HDInsight provides three different codecs: DEFLATE, GZip, and Bzip2. The largest gains in performance are likely to come from compressing the source data (which also reduces the storage space used) and the map output data because these are the biggest datasets. Compressing the map output data can also reduce the load on the network if investigation of the job counters and Windows performance counters indicate a network bandwidth limitation is slowing query execution at the start of the reduce phase. For more information about the codecs provided with Hadoop, and how you can compress the source data in HDInsight, see the section Compressing the Source Data in Chapter 3. To enable compression for data within the query process you can use the following configuration properties: Set the mapred.compress.map.output property to true to enable compression of the data output to disk at the end of the map phase, and set the mapred.map.output.compress.codec property to the name of the codec you want to use. For very complex HiveQL operations, which can generate multiple map operations on each server, set the hive.exec.compress.intermediate to true so that data passed between each of these map operations is compressed. Set the mapred.output.compress property to true to enable compression of the data output from the reduce stage (the job output data), and set the mapred.output.compress.codec property to the name of the codec you want to use. The default compression type is RECORD, which bases the compression algorithm on individual records. For a higher compression rate, but with an increased impact on performance, set the mapred.output.compression.type property to BLOCK so that the compression algorithm uses entire blocks of data.
You can set these properties dynamically in a query. For example, the following HiveQL query compresses the intermediate query results and, when applied to large volumes of data (typically over 1.5 GB), can improve query performance significantly. HiveQL SET mapred.compress.map.output=true; SET hive.exec.compress.intermediate=true; SET mapred.map.output.compression.codec =org.apache.hadoop.io.compress.GzipCodec;
SELECT col1, SUM(col2) total, FROM mytable GROUP BY col1; Alternatively, you can add the mapred.compress.map.output, mapred.output.compress, mapred.map.output.compress.codec, and mapred.output.compression.type properties to the conf\mapred- site.xml configuration file; and add the hive.exec.compress.intermediate to the hive-site.xml configuration file in the apps\dist\hive-[version]\conf\ folder.
You can change the compression settings when you submit a query by using a SET statement or by including -D parameters in the Hadoop command line. For information about changing configuration settings, see the section Configuring HDInsight Clusters and Jobs in Chapter 7 of this guide.
Resolving Performance Limitations by Adjusting the Cluster Configuration When executing a query, the number of instances of the map and reduce tasks significantly affects performance. You can change these settings to maximize the performance of a query, but its vital to determine when this is appropriate. Specifying too many can reduce performance just as much as specifying too few. You should use the values in the job counters and other logs to decide if the number should be changed, and to measure the effects of the changes you make. The two Hadoop properties mapred.map.tasks and mapred.reduce.tasks specify the number of map and reduce task instances. In most cases you should set these dynamically at runtime, though you can also add them to the conf\mapred-site.xml configuration file. You can also change the number of map and reduce slots available in the cluster, which gives Hadoop more freedom to automatically allocate additional map and reduce tasks to a query, by setting the mapred.tasktracker.reduce.tasks.maximum and mapred.tasktracker.map.tasks.maximum property values.
If you need to change the memory configuration settings for the servers in a cluster, perhaps to reduce the number of spilled records, you can set the following configuration properties: The io.sort.mb property. This property specifies the size of memory in megabytes of the buffer used when sorting the output of the map operations. The default in the src\mapred\mapred-default.xml file is 100. The io.sort.record.percent property. This property specifies the proportion of the memory buffer defined in io.sort.mb that is reserved for storing record boundaries of the map outputs (typically 16 bytes per record). The remaining space is used for the map output records. The default in the src\mapred\mapred-default.xml file is 0.05 (5%). The io.sort.spill.percent property. This property specifies the limit for the sort buffer when Hadoop will spill additional data to disk. It applies to both the map output memory buffer and the record boundaries index. The default in the src\mapred\mapred-default.xml file is 0.80 (80%), and you may gain some additional performance by setting it to 1.
Another factor that can affect performance through storage I/O and network load is the replication of the output from a query, which occurs to provide redundancy should a cluster node fail. The default is to create three replicas of the output. If you do not need the output to be replicated you can set the value of the dfs.replication property at runtime, or edit it in the conf\hdfs-site.xml configuration file. Resolving Performance Limitations by Adding More Servers to the Cluster As a last resort, if you cannot achieve acceptable performance using any of the methods previously described, you may need to consider adding more servers to the cluster. This will typically require you to destroy and recreate the cluster, and reload the data. Adding servers to a cluster will also increase the runtime cost. If you are using Hadoop as a basic data warehouse this can be problematic as you will have to recreate all of the Hive table definitions and any intermediate datasets used by the cluster. However, in HDInsight you can create a new cluster that takes its data from an existing storage account, which contains the data used by the previous cluster. This will reduce the work involved, although you will still need to recreate the table definitions.
For more information about recreating a cluster using the existing data, see the section Retaining Data and Metadata for a Cluster in Chapter 7 of this guide.
Summary This chapter described the core tasks when working with HDInsight: creating queries and transformations for data. The number of tools and technologies associated with Hadoop is growing by the day, but only some of these are fully supported in HDInsight. This chapter concentrates on these available tools and technologies, which account for the vast majority of tasks you are likely to need to perform. At the time of writing, Hive accounted for around 80% of all HDInsight jobs and so this chapter concentrates first on explaining what Hive is and what you can do with it. The chapter includes only a rudimentary description of how you can use Hive; in just enough depth to introduce the concepts, make you aware of the issues to consider when using it, and to help you recognize Hive queries as you learn more from the huge reservoir of knowledge and examples available on the web. Next, the chapter covers the other two main query technologies Pig and custom map/reduce, in enough depth for you to understand their usage and purpose. The chapter also covers subsidiary topics related to Big Data jobs such as using the Hadoop Streaming interface to write map/reduce code in non-Java languages, using user- defined functions (UDFs) within your queries, and unifying your queries by using HCatalog. Finally, the chapter covers the next two related stages of the typical Big Data query cycle: evaluating the results to ensure they are valid and useful, and tuning the solution. More Information For more information about HDInsight, see the Windows Azure HDInsight web page. A central point for TechNet articles about HDInsight is HDInsight Services For Windows. For examples of how you can use HDInsight, see the following tutorials on the HDInsight website: Using Hive with HDInsight Using Pig with HDInsight Using MapReduce with HDInsight
For more information about maximizing performance in HDInsight see the following Microsoft White Papers: Hadoop Job Optimization Performance of Hadoop on Windows in Hyper-V Environments
Chapter 5: Consuming and Visualizing the Results This chapter describes a variety of ways in which you can consume data from HDInsight for analysis and reporting. It explores the typical tools that are part of the Microsoft data platform, and that can integrate directly or indirectly with HDInsight. It also demonstrates how you can generate useful and attractive visualizations of the information you generate using HDInsight. Challenges Unlike a traditional relational database system, HDInsight does not automatically expose its data in the standard row and column format. For example, the results of a query may be just an output file stored in Windows Azure blob storage. In a typical Big Data analysis solution the results of custom map/reduce or Pig processing are generated as files in the HDFS file system (which, for HDInsight on Windows Azure, is hosted in Windows Azure blob storage). One of the challenges is to be able to connect to and access this data in order to consume it in an appropriate tool or application. HDInsight can, however, generate the equivalent to a row and column dataset, which is typically a better match to most data analysis tools and for import into another database system. It is common to create Hive tables over data stored in files in order to provide a query interface to that data. You can then use the ODBC driver for Hive to consume the results of an HDInsight query, even when the data was originally generated by a Pig or map/reduce operation during the earlier stages of the HDInsight solution.
When you generate results in HDInsight that you want to consume using other tools, you should consider generating these results in a format that will best suit your target tools. The format should provide an efficient import process that accurately copies the data without affecting the values or formatting in a way that could introduce errors. Typically, you will layer a Hive query over the results of other queries so that the data is available in the traditional row and column format. A second challenge arises when you want to consume the results of a query process in an enterprise database or business intelligence (BI) system, as discussed in Chapter 2 of this guide. In addition to performing the appropriate data cleansing and pre-processing tasks, you must consider how to create a data export process that can successfully push the data into the database or BI system, and how you can automate this process. You may decide to automate some or all of the stages of consuming data that are described in this chapter. Chapter 6 of this guide provides details or how you can create automated solutions using HDInsight and a range of other tools and frameworks. Options for Consuming Data from HDInsight The previous chapter of this guide described two rudimentary ways to view the result of an HDInsight query to validate that it produces meaningful and useful results. It demonstrated how you can use the interactive console in the HDInsight management dashboard to view the contents of the data files, and to create simple graphical data visualizations. These techniques are useful during the development of an HDInsight solution, and in some cases might provide a usable solution for simple analysis. However, in most enterprise environments the results of HDInsight data processing are consumed in analytical or reporting tools that enable business users to explore and visualize the insights gained from the Big Data process. This chapter explores the options for consuming data processing results from HDInsight, and provides guidance about using each option in the context of the architectural patterns for HDInsight solutions described earlier in this guide. The topics you will see discussed in this chapter are: Consuming HDInsight data in Microsoft Excel. Consuming HDInsight data in SQL Server Reporting Services (SSRS) and Windows Azure SQL Reporting. Consuming HDInsight data in SQL Server Analysis Services (SSAS). Consuming HDInsight data in a SQL Server database. Custom and third party analysis and visualization tools.
There is a huge range of data analysis and BI tools available from Microsoft and from third parties that you can use with your data warehouse and HDInsight systems. Microsoft products include server-based solutions such as SQL Server Reporting Services and SharePoint Server, cloud-based solutions such as Windows Azure SQL Reporting, and desktop or Office applications such as Excel and Word.
A high level diagram of the options for consuming data from HDInsight that are discussed in this chapter is shown in Figure 1.
Figure 1 Options for consuming data from HDInsight Figure 1 represents a map that shows the various routes through which data can be delivered from an HDInsight cluster to client applications. The client application destinations shown on the map include Microsoft Excel, SQL Server Reporting Services (SSRS), and Windows Azure SQL Reporting; however, you can also create custom applications or use common enterprise BI data visualization technologies such as PerformancePoint Services in Microsoft SharePoint Server to consume data from HDInsight, directly or indirectly, by using the same interfaces. Consuming HDInsight Data in Microsoft Excel Microsoft Excel is one of the most widely used software applications in the world, and is commonly used as a tool for interactive data analysis and reporting. It supports a wide range of data import and connectivity options, including: Built-in data connectivity to a wide range of data sources. The Data Explorer add-In. PowerPivot.
Excel includes a range of analytical tools and visualizations that you can apply to tables of data in one or more worksheets within a workbook. Additionally, all Excel 2013 workbooks encapsulate a data model in which you can define tables of data and relationships between them. These data models make it easier to slice and dice data in PivotTables and PivotCharts, and to create Power View visualizations. Excel is especially useful when you want to add value and insight by augmenting the results of your data analysis with external data. For example, you may perform an analysis of social media sentiment data by geographical region in HDInsight, and consume the results in Excel. This geographically oriented data can be enhanced by subscribing to a demographic dataset in the Windows Azure Data Market, from which socio-economic and population data may provide an insight into why your organization is more popular in some locations than in others. As well as datasets that you can download and use to augment your results, the Windows Azure Data Market includes a number of data services that you can use for data validation (for example, verifying that telephone numbers and postal codes are valid) and for data transformation (for example, looking up the country or region, state, and city for a particular IP address or longitude/latitude value). Built-In Data Connectivity Excel includes native support for importing data from a wide range of data sources including web sites, OData feeds, relational databases, and more. These data connectivity options are available on the Data tab of the ribbon, and enable business users to specify data source connection information and select the data to be imported through a wizard interface. The data is imported into a worksheet, and in Excel 2013 it can be added to the workbook data model and used for data visualization in a PivotTable Report, PivotChart, Power View Report, or GeoFlow map. The built-in data connectivity capabilities of Excel include support for ODBC sources, and are an appropriate option for consuming data from HDInsight when the data is accessible through Hive tables. To use this option the Hive ODBC driver must be installed on the client computer where Excel will be used. There is a 32-bit and a 64-bit version of the driver available for download from Microsoft ODBC Driver For Hive. You should install the one that matches the version of Windows on the target computer. When you install the 64-bit version it also installs the 32-bit version so you will be able to use it to connect to Hive from both 64-bit and 32-bit applications. You can simplify the process of connecting to HDInsight by using the Data Sources (ODBC) administrative tool to create a data source name (DSN) that encapsulates the ODBC connection information, as shown in Figure 2. Creating a DSN makes it easier for business users with limited experience of configuring data connections to import data from Hive tables defined in HDInsight. If you set up both 32-bit and 64-bit DSNs using the same name, clients will use the appropriate one automatically.
Figure 2 Creating an ODBC DSN for Hive Almost all of the techniques for consuming data from HDInsight rely on the ODBC driver that you can obtain and install from the Windows Azure HDInsight management portal. The ODBC driver provides a standardized data exchange mechanism that works with most Microsoft tools and applications, and with tools and applications from many third party vendors. It also makes it easy to consume the data in your own custom applications.
After you create a DSN, users can specify it in the data connection wizard in Excel and then select a Hive table or view to be imported into the workbook, as shown in Figure 3.
Figure 3 Importing a Hive table into Excel The Hive tables shown in Figure 3 are used as an example throughout this chapter. These tables contain UK meteorological observation data for 2012, obtained from the UK Met Office Weather Open Data dataset in the Windows Azure Data Market at https://datamarket.azure.com/dataset/datagovuk/metofficeweatheropendata. Using the built-in data connectivity capabilities of Excel is a good solution when business users need to perform simple analysis and reporting based on the results of an HDInsight query, as long as the data can be encapsulated in Hive tables. Data import can be performed with any client computer on which the ODBC driver for Hive is installed, and is supported in all editions of Excel. In Excel 2013 users can add the imported data to the workbook data model and combine it with data from other sources to create analytical mashups. However, in scenarios where complex data modeling is the primary goal, using an edition of Excel that supports PowerPivot offers greater flexibility and modeling functionality and makes it easier to create PivotTable Report, PivotChart, Power View, and GeoFlow visualizations. The following table describes specific considerations for using Excels built-in data connectivity in the HDInsight solution architectures described earlier in this guide. Architectural Model Considerations Standalone experimentation and visualization Built-in data connectivity in Excel is a suitable choice when the results of the data processing can be encapsulated in a Hive table, or a query with simple joins can be encapsulated in a Hive view, and the volume of data is sufficiently small to support interactive connectivity with tolerable response times. Data transfer, data cleansing, and ETL Most ETL scenarios are designed to transform Big Data into a suitable structure and volume for storage in a relational data source for further analysis and querying. While Excel may be used to consume the data from the relational data source after it has been transferred from HDInsight, it is unlikely that an Excel workbook would be the direct target for the ETL process. Basic data warehouse or commodity storage When HDInsight is used to create a basic data warehouse containing Hive tables, business users can use the built-in data connectivity in Excel to consume data from those tables for analysis and reporting. However, for complex data models that require multiple related tables and queries with complex joins, PowerPivot may be a better choice. Integration with Enterprise BI Importing data from a Hive table and combining it with data from a BI data source (such as a relational data warehouse or corporate data model) is an effective way to accomplish report-level integration with an enterprise BI solution. However, in self-service analysis scenarios advanced users such as business analysts may require a more comprehensive data modeling solution such as that offered by PowerPivot.
While Excel can, itself, provide a platform for wide-ranging exploration and visualization of data, the powerful add-in tools such as the Data Explorer, PowerPivot, Power View, and GeoFlow can make this task easier and help you to create more meaningful and immersive results. The Data Explorer Add-In The Data Explorer add-in enhances Excel by providing a comprehensive interface for querying a wide range of data sources, and exploring the results. Data Explorer includes a data source provider for Windows Azure HDInsight, which enables users to browse the HDFS folders in Windows Azure blob storage that are associated with an HDInsight cluster. Users can then import data from files such as those generated by map/reduce jobs and Pig scripts, and import the underlying data files associated with Hive tables. After the Data Explorer add-in has been installed, the From Other Sources option on the Data Explorer tab in the Excel ribbon can be used to import data from Windows Azure HDInsight. The user specifies the account name and key of the Azure blob store, not the HDInsight cluster itself. This enables organizations to use this technique to consume the results of HDInsight processing even if the cluster has been released, significantly reducing costs if no further HDInsight processing is required. Keeping an HDInsight cluster running when it is not executing queries just so that you can access the data incurs charges to your Windows Azure account. If you are not using the cluster, you can close it down but still be able to access the data at any time using the Data Explorer in Excel, or any other tool that can access Windows Azure blob storage. After connecting to the Azure blob store the user can select a data file, convert its contents to a table by specifying the appropriate delimiter, and modify the data types of the columns before importing it into a worksheet, as shown in Figure 4.
Figure 4 Importing data from HDInsight with the Data Explorer add-in The imported data can be added to the workbook data model, or analyzed directly in the worksheet. The following table describes specific considerations for using Data Explorer in the HDInsight solution architectures described earlier in this guide. Architectural Model Considerations Standalone experimentation and visualization Data Explorer is a good choice when HDInsight data processing techniques such as map/reduce jobs or Pig scripts generate files that contain the results to be analyzed or reported. The HDInsight cluster can be deleted after the processing is complete, leaving the results in the Azure blob storage ready to be consumed by business users in Excel. Data transfer, data cleansing, and ETL Most ETL scenarios are designed to transform Big Data into a suitable structure and volume for storage in a relational data source for further analysis and querying. It is unlikely that Data Explorer would be used to consume data files from the Windows Azure blob storage associated with the HDInsight cluster, though it may be used to consume data from the relational data store loaded by the HDInsight ETL process. Basic data warehouse or commodity storage When HDInsight is used to implement a basic data warehouse it usually includes a schema of Hive tables that are queried over ODBC connections. While it is technically possible to use Data Explorer to import data from the files on which the Hive tables are based, more complex Hive solutions that include partitioned tables would be difficult to consume this way. Integration with Enterprise BI Importing data from a file in Windows Azure blob storage and combining it with data from a BI data source (such as a relational data warehouse or corporate data model) is an effective way to accomplish report-level integration with an enterprise BI solution. However, in self-service analysis scenarios advanced users such as business analysts may require a more comprehensive data modeling solution such as that offered by PowerPivot.
At the time this guide was written the Data Explorer add-in was a preview version. You can download it from the Microsoft Download Center.
PowerPivot The growing awareness of the value of decisions based on proven data, combined with advances in data analysis tools and techniques, has resulted in an increased demand for versatile analytical data models that support ad- hoc analysis (the self-service approach). PowerPivot is an Excel-based technology in Microsoft Office 2013 Professional Plus that enables advanced users to create complex data models that include hierarchies for drill-up/drill-down aggregations, custom data access expression (DAX) calculated measures, key performance indicators (KPIs) and other features not available in basic data models. PowerPivot is also available as an add-in for previous releases of Microsoft Excel. PowerPivot uses xVelocity compression technology to support in-memory data models that enable high-performance analysis, even with extremely large volumes of data.
Users can create and edit PowerPivot data models by using the PowerPivot Manager interface, which is accessed from the PowerPivot tab of the ribbon in Excel. Through this interface, users can enhance tables that have been added to the workbook data model by other means, and also import multiple tables from one or more data sources into the data model and define relationships between them. Figure 5 shows a data model in PowerPivot Manager.
Figure 5 PowerPivot Manager
PowerPivot brings many of the capabilities of enterprise BI to Excel, enabling business analysts to create personal data models for sophisticated self-service data analysis. Users can share PowerPivot workbooks through SharePoint Server where they can be viewed interactively in a browser, enabling teams of analysts to collaborate on data analysis and reporting.
The following table describes specific considerations for using PowerPivot in the HDInsight solution architectures described earlier in this guide.
Architectural Model Considerations Standalone experimentation and visualization PowerPivot is a good choice when HDInsight data processing results can be encapsulated in Hive tables and imported directly into the PowerPivot data model over an ODBC connection, or when output files can be imported into the Excel data model using Data Explorer. PowerPivot enables business analysts to enhance the data model and share it across teams through SharePoint Server. Data transfer, data cleansing, and ETL Most ETL scenarios are designed to transform Big Data into a suitable structure and volume for storage in a relational data source for further analysis and querying. In this scenario, PowerPivot may be used to consume data from the relational data store loaded by the ETL process. Basic data warehouse or commodity storage When HDInsight is used to implement a basic data warehouse it usually includes a schema of Hive tables that are queried over ODBC connections. PowerPivot makes it easy to import multiple Hive tables into a data model, define relationships between them, and refresh the model with new data as the data files in HDInsight are updated. Integration with Enterprise BI PowerPivot is a core feature in Microsoft-based BI solutions, and enables self- service analysis that combines enterprise BI data with Big Data processing results from HDInsight at the report level.
Visualizing Data in Excel After the data has been imported using any of the previously described techniques, business users can employ the full range of Excel analytical and visualization functionality to explore the data and create reports that include color-coded data values, charts, indicators, and interactive slicers and timelines.
PivotTables and PivotCharts PivotTables and PivotCharts are a common way to aggregate data values across multiple dimensions. For example, having imported a table of weather data, users could use a PivotTable or PivotChart to view temperature aggregated by month and geographical region, as shown in Figure 6.
Figure 6 Analyzing data with a PivotTable and a PivotChart in Excel
PivotTables and PivotCharts are typically most useful for aggregating data across multiple dimensions. If you want to generate interactive graphical reports that allow users to explore relationships, then Power View is a better choice.
Power View Power View is a data visualization technology that enables interactive, graphical exploration of data in a data model. Power View is available as a component of SQL Server 2012 Reporting Services when integrated with SharePoint Server, but is also available in Excel 2013. By using Power View, users can create interactive visualizations that make it easy to explore relationships and trends in the data. Figure 7 shows how Power View can be used to visualize the weather data in the PowerPivot data model described previously.
Figure 7 Visualizing data with Power View
GeoFlow GeoFlow is an add-in for Microsoft Excel 2013 Professional Plus that enables you to visualize geospatial and temporal analytical data on a map. With GeoFlow you can display geographically related values on an interactive map, and create a virtual tour that shows the data in 3D. If the data has a temporal dimension you can incorporate time-lapse animation into the tour to illustrate how the values change over time. Figure 8 shows a GeoFlow visualization in Excel 2013.
Figure 8 Visualizing geospatial data with GeoFlow GeoFlow was a pre-release add-in for Excel at the time this guide was written. You can download the GeoFlow add-in from the Microsoft Office Download Center. Consuming HDInsight Data in SQL Server Reporting Services SQL Server 2012 Reporting Services (SSRS) provides a platform for delivering reports that can be viewed interactively in a Web browser, printed, or exported and delivered automatically in a wide range of electronic document formats; including Microsoft Excel, Microsoft Word, Portable Document Format (PDF), image, and XML. Reports consist of data regions such as tables of data values or charts, which are based on datasets that encapsulate queries. The datasets used by a report are in turn based on data sources, which can include relational databases, corporate data models in SQL Server Analysis Services, or other OLE DB or ODBC sources.
If you load the results of an HDInsight query into SQL Server or Windows Azure SQL Database you can use the database reporting tools to produce reports in a variety of formats that are easily accessible to users. SQL Server Reporting Services offers a wide range of options and fine control over the final reports. Windows Azure SQL Reporting is less powerful, but can still meet most users needs and requires no on-premises resources. Reports can be created by professional report authors, who are usually specialists BI developers, using the Report Designer tool in SQL Server Data Tools. This tool provides a Visual Studio-based interface for creating data sources, datasets, and reports; and for publishing a project of related items to a report server. Figure 9 shows Report Designer being used to create a report based on data in HDInsight, which is accessed through an ODBC data source and a dataset that queries a Hive table.
Figure 9 Report Designer While many businesses rely on reports created by BI specialists, an increasing number of organizations are empowering business users to create their own self-service reports. To support this scenario business users can use Report Builder, shown in Figure 10. This is a simplified report authoring tool that is installed on demand from a report server. To further simplify self-service reporting you can have BI professionals create and publish shared data sources and datasets that can be easily referenced in Report Builder, reducing the need for business users to configure connections or write queries.
Figure 10 Report Builder
Reports are published to a report server, where they can be accessed interactively through a web interface. When Reporting Services is installed in native mode, the interface is provided in a web application named Report Manager, as shown in Figure 11. You can also install Reporting Services in SharePoint-integrated mode, in which case the reports are accessed in a SharePoint document library.
Figure 11 Viewing a report in Report Manager
Reporting Services provides a powerful platform for delivering reports in an enterprise, and can be used effectively in a Big Data solution. The following table describes specific considerations for using Reporting Services in the HDInsight solution architectures described earlier in this guide.
Architectural Model Considerations Standalone experimentation and visualization For one-time analysis and data exploration, Excel is generally a more appropriate tool than Reporting Services because it requires less in the way of infrastructure configuration and provides a more dynamic user interface for interactive data exploration. Data transfer, data cleansing, and ETL Most ETL scenarios are designed to transform Big Data into a suitable structure and volume for storage in a relational data source for further analysis and querying. In this scenario, Reporting Services may be used to consume data from the relational data store loaded by the ETL process. Basic data warehouse or commodity storage When HDInsight is used to implement a basic data warehouse it usually includes a schema of Hive tables that are queried over ODBC connections. In this scenario you can use Reporting Services to create ODBC-based data sources and datasets that query Hive tables. Integration with Enterprise BI You can use Reporting Services to integrate data from HDInsight with enterprise BI data at the report level by creating reports that display data from multiple data sources. For example, you could use an ODBC data source to connect to HDInsight and query Hive tables, and an OLE DB data source to connect to a SQL Server data warehouse. However, in an enterprise BI scenario that combines corporate data and Big Data in formal reports, better integration can generally be achieved by integrating at the data warehouse or corporate data model level, and by using a single data source in Reporting Services to connect to the integrated data. Windows Azure SQL Reporting Windows Azure SQL Reporting is a cloud-based service that offers similar functionality to SSRS, but with some limitationsthe most significant of which is that Windows Azure SQL Reporting can only be used to consume data from Windows Azure SQL Database. Therefore, to include the results of HDInsight data processing in a report, you must transfer the results from HDInsight to Windows Azure SQL Database by using a technology such as Sqoop (which is described later in this chapter). Although Windows Azure SQL Reporting is a cloud-based service that natively generates browser based reports, you can also use it to generate reports that you embed in applications built with Windows Forms and other technologies. Some other features of SSRS are not supported in Windows Azure SQL Reporting; for example, you cannot schedule report execution, create subscriptions that automatically distribute reports by email, or use customized report extensions. However, you can create reports in multiple formats for rendering in a browser, Report Viewer, and custom applications built with Windows Forms or other technologies. For a comparison of SSRS and SQL Reporting see Guidelines and Limitations for Windows Azure SQL Reporting. You can use Report Designer or Report Builder to author reports for Windows Azure SQL Reporting Services, and upload them to a Windows Azure based report server. The following table describes specific considerations for using Windows Azure SQL Reporting Services in the HDInsight solution architectures described earlier in this guide.
Architectural Model Considerations Standalone experimentation and visualization The requirement to transfer the results from HDInsight into Windows Azure SQL Database adds complexity to the process of visualizing the results; so this approach should only be used if you intend to export the data to Windows Azure SQL Database or you specifically need cloud-based reporting of the output from HDInsight. Data transfer, data cleansing, and ETL If the target of the HDInsight-based ETL process is Windows Azure SQL dDatabase, then Windows Azure SQL Reporting Services may be a suitable choice for reporting the results. Basic data warehouse or commodity storage When HDInsight is used to implement a basic data warehouse it usually includes a schema of Hive tables that are queried over ODBC connections. Since these cannot be accessed by Windows Azure SQL Reporting Services it is generally unsuitable for this kind of solution. Integration with Enterprise BI Most enterprise BI solutions are implemented in on-premises data warehouses and data models. Unless a subset of the BI solution is replicated to Windows Azure SQL Database for cloud-based reporting, Windows Azure SQL reporting Services is generally not suitable for enterprise BI integration. Consuming HDInsight Data in SQL Server Analysis Services In many organizations, enterprise BI reporting and analytics is a driver for business processes and decision making that involves multiple users in different parts of the business. In these kinds of scenarios, analytical data is often structured into a corporate data model (often referred to colloquially as a cube) and accessed from multiple client applications and reporting tools. Corporate data models help simplify the creation of analytical mashups and reports, and ensure consistent values are calculated for core business measures and key performance indicators across the organization. Appendix B of this guide provides an overview of enterprise BI systems that might be useful if you are not familiar with the terminology and some special factors inherent in such systems. In the Microsoft data platform, SQL Server Analysis Services (SSAS) 2012 provides a mechanism for creating and publishing corporate data models that can be used as a source for PivotTables in Excel, reports in SQL Server Reporting Services, dashboards in SharePoint Server PerformancePoint Services, and other BI reporting and analysis tools. You can install SSAS in one of two modes; tabular mode or multidimensional mode. SSAS in Tabular Mode When installed in tabular mode, SSAS 2012 can be used to host tabular data models that are based on the xVelocity in-memory analytics engine. These models use the same technology and design as PowerPivot data models in Excel, but can be scaled to handle much larger volumes of data and secured using enterprise-level role based security. You create tabular data models in the Visual Studio based SQL Server Data Tools development environment, and you can choose to create the data model from scratch or import an existing PowerPivot for Excel workbook. Because tabular models support ODBC data sources you can easily include data from Hive tables in the data model, and you can use HiveQL queries to pre-process the data as it is imported to create the schema that best suits your analytical and reporting goals. You can then use the modeling capabilities of SQL Server Analysis Services to create relationships, hierarchies, and other custom model elements to support the analysis that users need to perform. Figure 12 shows a tabular model for the weather data used earlier in this chapter.
Figure 12 A tabular Analysis Services data model
The tabular data model mode of SQL Server Analysis Services supports loading data with the ODBC driver, which means that you can connect to the results of a Hive query to analyze the data. However, the multidimensional mode does not support ODBC, and so you must either import the data into SQL Server, or configure a linked server to pass through the query and results. The following table describes specific considerations for using tabular SSAS data models in the HDInsight solution architectures described earlier in this guide.
Architectural Model Considerations Standalone experimentation and visualization If the results of HDInsight data processing can be encapsulated in Hive tables, and multiple users must perform consistent analysis and reporting of the results, then an SSAS tabular model is an easy way to create a corporate data model for analysis. However, if the analysis will only be performed by a small group of specialist users, you can probably achieve this by using PowerPivot in Excel. Data transfer, data cleansing, and ETL If the target of the HDInsight-based ETL process is a relational database then you might build a tabular data model based on the tables in the database to enable enterprise-level analysis and reporting. Basic data warehouse or commodity storage When HDInsight is used to implement a basic data warehouse it usually includes a schema of Hive tables that are queried over ODBC connections. A tabular data model is a good way to create a consistent analytical and reporting schema based on the Hive tables. Integration with Enterprise BI If the enterprise BI solution already uses tabular SSAS models, you can add Hive tables to these models and create any necessary relationships to support integrated analysis of corporate BI data and Big Data results from HDInsight. However, if the corporate BI data warehouse is based on a dimensional model that includes surrogate keys and slowly changing dimensions it can be difficult to define relationships between tables in the two data sources, and integration may be better accomplished at the data warehouse level.
SSAS in Multidimensional Mode As an alternative to tabular mode, SSAS can be installed in multidimensional modewhich is the only mode supported by releases of SSAS prior to SQL Server 2012. Multidimensional mode provides support for a more established online analytical processing (OLAP) approach to cube creation, and includes some features that are not supported, or are difficult to implement, in tabular data models (such as the ability to aggregate semi- additive measures across accounting dimensions, or include international translations in the cube definition). Additionally, if you plan to use SSAS data mining functionality you must install SSAS in multidimensional mode. Multidimensional data models can be built on OLE DB data sources, but due to some restrictions in the way cube elements are implemented you cannot use an ODBC data source for a multidimensional data model. This means that there is no way to directly connect dimensions or measure groups in the data model to Hive tables in an HDInsight cluster. To use HDInsight data as a source for a multidimensional data model in SSAS you must either transfer the data from HDInsight to a relational database system such as SQL Server, or define a linked server in a SQL Server instance that can act as a proxy and pass queries through to Hive tables in HDInsight. The use of linked servers to access Hive tables from SQL Server, and techniques for transferring data from HDInsight to SQL Server, are described in the next section of this chapter. Figure 13 shows a multidimensional data model in SQL Server Data Tools. This data model is based on views in a SQL Server database, which are in turn based on queries against a linked server that references Hive tables in HDInsight.
Figure 13 A multidimensional Analysis Services data model The following table describes specific considerations for using multidimensional SSAS data models in the HDInsight solution architectures described earlier in this guide. Architectural Model Considerations Standalone experimentation and visualization For one-time analysis, or analysis by a small group of users, the requirement to use a relational database such as SQL Server as a proxy or interim host for the HDInsight results means that this approach involves more effort than using a tabular data model or just analyzing the data in Excel. Data transfer, data cleansing, and ETL If the target of the HDInsight-based ETL process is a relational database that can be accessed through an OLE DB connection, you might build a multidimensional data model based on the tables in the database to enable enterprise-level analysis and reporting. Basic data warehouse or commodity storage When HDInsight is used to implement a basic data warehouse it usually includes a schema of Hive tables that are queried over ODBC connections. To include these tables in a multidimensional SSAS data model you would need to use a linked server in a SQL Server instance to access the Hive tables on behalf of the data model, or transfer the data from HDInsight to a relational database. Unless you specifically require advanced OLAP capabilities that can only be provided in a multidimensional data model, a tabular data model is probably a better choice. Integration with Enterprise BI If the enterprise BI solution already uses multidimensional SSAS models that you want to extend to include data from HDInsight then you should integrate the data at the data warehouse level and base the data model on the relational data warehouse. Consuming HDInsight Data in a SQL Server Database Many organizations already use Microsoft data platform components to perform data analysis and reporting, and to implement enterprise BI solutions. Additionally, many tools and reporting applications make it easier to query and analyze data in a relational database than to consume the data directly from HDInsight. For example, as discussed previously in this chapter, Windows Azure SQL Reporting Services can only consume data from Windows Azure SQL Database, and multidimensional SSAS data models can only be based on relational data sources. After you discover from experimentation a query that provides the data input you require, you will probably want to integrate HDInsight with your existing BI system. This allows you to easily repeat the query on a regular basis and use it to update your knowledge base. The wide-ranging support for working with data in a relational database means that, in many scenarios, the optimal way to perform Big Data analysis is to process the data in HDInsight but consume the results through a relational database system such as SQL Server. You can accomplish this by enabling the relational database to act as a proxy or interface that transparently accesses the data in HDInsight on behalf of its client applications, or by transferring the results of HDInsight data processing to tables in the relational database. Linked Servers Linked servers are server-level connection definitions in a SQL Server instance that enable queries in the local SQL Server engine to reference tables in remote servers. You can use the ODBC driver for Hive to create a linked server in a SQL Server instance that references an HDInsight cluster, enabling you to execute Transact-SQL queries that reference Hive tables. To create a linked server you can either use the graphical tools in SQL Server Management Studio or the sp_addlinkedserver system stored procedure, as shown in the following code sample. Transact-SQL EXEC master.dbo.sp_addlinkedserver @server = N'HDINSIGHT', @srvproduct=N'Hive', @provider=N'MSDASQL', @datasrc=N'HiveDSN', @provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=UserName; Password=P@assw0rd;' After you have defined the linked server you can use the Transact-SQL OpenQuery function to execute pass- through queries against the Hive tables in the HDInsight data source, as shown in the following code sample. Transact-SQL SELECT * FROM OpenQuery(HDINSIGHT, 'SELECT * FROM Observations');
Using a four-part distributed query as the source of the OpenQuery statement is not always a good idea because the syntax of HiveQL differs from T-SQL in several ways. By using this technique you can create views in a SQL Server database that act as pass-through queries against Hive tables, as shown in Figure 14. These views can then be queried by analytical tools that connect to the SQL Server database.
Figure 14 Using HDInsight as a linked server over ODBC The following table describes specific considerations for using linked servers in the HDInsight solution architectures described earlier in this guide. Architectural Model Considerations Standalone experimentation and visualization For one-time analysis, or analysis by a small group of users, the requirement to use a relational database such as SQL Server as a proxy or interim host for the HDInsight results means that this approach involves more effort than using a tabular data model or just analyzing the data in Excel. Data transfer, data cleansing, and ETL Generally, the target of the ETL processes is a relational database, making a linked server to Hive tables unnecessary. Basic data warehouse or commodity storage Depending on the volume of data in the data warehouse, and the frequency of queries against the back-end Hive tables, creating a SQL Server faade for a Hive-based data warehouse might make it easier to support a wide range of client applications. A linked server is a suitable solution for populating data models on a regular basis when they are processed during out-of-hours periods, or for periodically refreshing cached datasets for Reporting Services. However, the performance of pass-through queries over an ODBC connection may not be sufficient to meet your users expectations for interactive querying and reporting directly in client applications such as Excel. Integration with Enterprise BI If the ratio of Hive tables to data warehouse tables is small, or they are relatively rarely queried, a linked server might be a suitable way to integrate data at the data warehouse level. However, if there are many Hive tables or if the data in the Hive tables must be tightly integrated into a dimensional data warehouse schema, it may be more effective to transfer the data from HDInsight to local tables in the data warehouse. PolyBase PolyBase is a data integration technology in SQL Server 2012 Parallel Data Warehouse (PDW) that enables data in an HDInsight cluster to be queried as native tables in the SQL Server data warehouse. SQL Server PDW is an edition of SQL Server that is only available pre-installed on a data warehouse hardware appliance, and it uses a massively parallel processing (MPP) architecture to implement highly scalable relational data warehouse solutions. PolyBase enables parallel data movement between SQL Server and HDInsight, and supports standard Transact- SQL semantics such as GROUP BY and JOIN clauses that reference large volumes of data in HDInsight. This enables SQL Server PDW to provide an enterprise-scale data warehouse solution that combines relational data in data warehouse tables and data in an HDInsight cluster. For more information about SQL Server PDW and PolyBase, see the page PolyBase on the SQL Server website. The following table describes specific considerations for using PolyBase in the HDInsight solution architectures described earlier in this guide. Architectural Model Considerations Standalone experimentation and visualization For one-time analysis, or analysis by a small group of users, the requirement to use a SQL Server PDW appliance may be cost-prohibitive. Data transfer, data cleansing, and ETL Generally, the target of the ETL process is a relational database, making PolyBase integration with HDInsight unnecessary. Basic data warehouse or commodity storage If the volume of data and number of query requests are extremely high, using a SQL Server PDW appliance as a data warehouse that includes HDInsight data through PolyBase might be the most cost-effective way to achieve the required levels of performance and scalability your data warehousing solution requires. Integration with Enterprise BI If your enterprise BI solution already uses SQL Server PDW, or the combined scalability and performance requirements for enterprise BI and Big Data analysis is extremely high, using SQL Server PDW with PolyBase might be a suitable solution. However, you should note that PolyBase does not inherently integrate HDInsight data into a dimensional data warehouse schema, so if you need to include Big Data in dimension members that use surrogate keys or need to support slowly changing dimensions, you may need to do some additional integration work. Sqoop Sqoop is a Hadoop technology that is included in HDInsight. It is designed to make it easy to transfer data between Hadoop clusters and relational databases. You can use Sqoop to export data from HDInsight data files to SQL Server database tables by specifying the location of the data files to be exported, and a JDBC connection string for the target SQL Server instance. For example, you could run the following command on an HDInsight server to copy the data in the /hive/warehouse/observations folder to the observations table in a Windows Azure SQL Database named mydb located in a server named jkty65. Sqoop command sqoop export --connect "jdbc:sqlserver://jkty65.database.windows.net:1433; database=mydb;user=username@jkty65;password=Pa$$w0rd; logintimeout=30;" --table observations --export-dir /hive/warehouse/observations A clustered index must be defined on the destination table in Windows Azure SQL Database. A key requirement is that network connectivity can be successfully established between the HDInsight cluster where the Sqoop command is executed and the SQL Server instance to which the data will be transferred. When used with HDInsight for Windows Azure, this means that the SQL Server instance must be accessible from the Windows Azure service where the cluster is running, which may not be permitted by security policies in organizations where the target SQL Server instance is hosted in an on-premise data center. However, Sqoop is generally a good solution for transferring data from HDInsight to Windows Azure SQL Database servers, or to instances of SQL Server that are hosted in virtual machines running in Windows Azure, but it can present connectivity challenges when used with on-premise database servers. Sqoop is ideal for transferring data between HDInsight and a database located in the cloud, but may be more difficult to use with on-premises database servers due to network security restrictions. The following table describes specific considerations for using Sqoop in the HDInsight solution architectures described earlier in this guide. Architectural Model Considerations Standalone experimentation and visualization For one-time analysis, or analysis by a small group of users, using Sqoop is a simple way to transfer the results of data processing to Windows Azure SQL Database or a SQL Server instance for reporting or analysis. Data transfer, data cleansing, and ETL Generally, the target of the ETL processes is a relational database and Sqoop may be the mechanism that is used to load the transformed data into the target database. Basic data warehouse or commodity storage When using HDInsight as a data warehouse for Big Data analysis the data is generally accessed directly in Hive tables, making transfer to a database using Sqoop unnecessary. Integration with Enterprise BI When you want to integrate the results of HDInsight processing with an enterprise BI solution at the data warehouse level you can use Sqoop to transfer data from HDInsight to the data warehouse tables, or more commonly to staging tables from where it will be loaded into the data warehouse. However, if network connectivity between HDInsight and the target database is not possible you may need to consider an alternative technique to transfer the data, such as SQL Server Integration Services. SQL Server Integration Services SQL Server Integration Services (SSIS) provides a flexible platform for building ETL solutions that transfer data between a wide range of data sources and destinations while applying transformation, validation, and data cleansing operations to the data as it passes through a data flow pipeline. SSIS is a good choice for transferring data from HDInsight to SQL Server when network security policies make it impossible to use Sqoop, or when you must perform complex transformations on the data as part of the import process. We use SQL Server Integration Services to ensure that the data we load into our BI system is valid. This includes data cleansing operations such as changing multiple values to a standard leading value so that queries against the data provide the correct results. Appendix B of this guide explains some of the areas where data cleansing is useful, if not vital. To transfer data from HDInsight to SQL Server using SSIS you can create an SSIS package that includes a data flow task. The data flow task minimally consists of a source, which is used to extract the data from the HDInsight cluster; and destination, which is used to load the data into a SQL Server database. The data flow might also include one or more transformations, which apply specific changes to the data as it flows from the source to the destination. The source used to extract data from HDInsight can be an ODBC source that uses a HiveQL query to retrieve data from Hive tables, or a custom source that programmatically downloads data from files in the Windows Azure blob storage that hosts the HDFS folders used by the HDInsight cluster. Figure 15 shows an SSIS data flow that uses an ODBC source to extract data from Hive tables and then applies a transformation to convert the data types of the columns returned by the query, before loading the transformed data into a table in a SQL Server database.
Figure 15 Using SSIS to transfer data from HDInsight to SQL Server The following table describes specific considerations for using SSIS in the HDInsight solution architectures described earlier in this guide. Architectural Model Considerations Standalone experimentation and visualization For one-time analysis, or analysis by a small group of users, using SSIS can be a simple way to transfer the results of data processing to SQL Server for reporting or analysis. Data transfer, data cleansing, and ETL Generally, the target of the ETL processes is a relational database. In some cases the HDInsight ETL process might be used to transform the data into a suitable structure, and then SSIS can be used to complete the ETL process by transferring the transformed data to SQL Server. Basic data warehouse or commodity storage When using HDInsight as a data warehouse for Big Data analysis the data is generally accessed directly in Hive tables, making transfer to a database via SSIS unnecessary. Integration with Enterprise BI SSIS is often used to implement ETL processes in an enterprise BI solution, so its a natural choice when you need to extend the BI solution to include big data processing results from HDInsight. SSIS is particularly useful when you want to integrate data from HDInsight into an enterprise data warehouse, where you will typically uses SSIS to extract the data from HDInsight into a staging table and even use a different SSIS package to load the staged data in synchronization with data from other corporate sources. Custom and Third Party Analysis and Visualization Tools There are many third party tools for consuming data and generating analytics and visualizations that are not covered in this chapter. Useful lists of analysis and visualization tools can be found at Datavisualization.ch (see A Carefully Selected List of Recommended Tools) and at ComputerWorld (see 30+ free tools for data visualization and analysis). Using the ODBC driver for Hive allows us to collect the data we want to visualize and then display it using any of the myriad graphics libraries or charting frameworks that are available. We can also use Linq to query the data and build powerful and interactive custom analytical UIs and tools. You can also build your own custom UI. You might choose to use one of the many of graphic libraries, such as D3.js, that are available for use in data visualization tools and web applications. The .NET Framework also contains many classes that make it easier to create attractive graphical visualizations such as charts in your custom applications. To access the data required to build your UI content you can connect to the output generated by HDInsight using the ODBC driver directly. However, developers working with the .NET Framework often use the Language Integrated Query (Linq) technology to work with data. The Microsoft .NET SDK for Hadoop includes the HiveConnection class, which enables you to write code that uses a Linq query to extract data from Hive. You can use this class to build custom applications that consume data from Hive tables in an HDInsight cluster. Figure 16 illustrates this feature of the SDK.
Figure 16 Using LinqToHive to build custom visualizations from HDInsight data For information about using LinqToHive see Microsoft .NET SDK For Hadoop on Codeplex. Summary This chapter described a variety of technologies and techniques that you can use to consume data from HDInsight and use it to perform analysis or reporting. It describes the many options available using Microsoft Excel, SQL Server Reporting Services and SQL Azure Reporting, SQL Server Analysis Services, and loading the data into a SQL Server database to make it available to a range of other tools. This chapter concludes the exploration of the main loading, querying, and analysis techniques for use with HDInsight. In the next chapter you will see how you can combine many of these techniques, and explore other tools and frameworks that help you to build complete end-to-end solutions around Windows Azure HDInsight. More Information For more information about HDInsight, see the Windows Azure HDInsight web page. For more information about the tools and add-ins for Excel see the Microsoft Excel website. For more information about the Microsoft .NET SDK for Hadoop, see the CodePlex project site at http://hadoopsdk.codeplex.com/.
Chapter 6: Debugging, Testing, and Automating HDInsight Solutions Now that youve seen how HDInsight works and how you can use it in a range of different ways, this chapter explores two topics that will help you to get the most from HDInsight. As with all development tasks you are likely to need to debug and test your code, especially if you are writing custom map/reduce components. In addition, after you find an HDInsight solution that provides useful information, you will want to integrate the process with your existing business systems by automating some or all of the stages, or by incorporating a workflow. Challenges HDInsight solutions are not usually code-heavy. Often you will use high-level tools such as Pig and Hive to create scripts that are executed by Hadoop. Techniques for debugging and testing these scripts are generally quite rudimentary. However, if you write complex map/reduce code and you are familiar with working with Java you may want to apply more comprehensive testing to your solution. For example, you may need to resolve issues with parallel execution of map and reduce components in a multi-node cluster, or view runtime information and intermediate values. Many scenarios for using a Big Data solution such as HDInsight will focus on exploring data, perhaps from newly discovered sources, and then iteratively refining the queries and transformations used to find insights within that data. After you discover questions that provide useful and valid information from the data, you will probably want to explore how you can automate the process. For example, you may want to make it repeatable without requiring manual interaction every time, and perhaps execute it automatically on a schedule.
If you are integrating HDInsight with your existing business processes you will probably want to automate the solution to avoid the need for repeated manual interaction. This approach can also help to minimize errors, and by setting permissions on the client applications you can also limit access so that only authorized users can execute them.
HDInsight supports several technologies and techniques that you can use instead of the manual interaction methods you have seen described in this guide. Several of these automation technologies are used in the scenario examples later in this guide. In this chapter you will see the most common techniques and tools for job automation, and advice on choosing and implementing the most appropriate ones for your own situation.
The chapter covers the following topics: Debugging and testing HDInsight solutions. Executing HDInsight jobs remotely using the .NET SDK for Hadoop. Creating HDInsight automated workflows using Oozie, SQL Server Integration Services, third-party frameworks, and custom tools. Building end-to-end solutions that use HDInsight.
Debugging and Testing HDInsight Solutions Debugging and testing an HDInsight solution is more difficult than a typical local application. Executable applications that run on the local development machine, and web applications that run on a local development web server such as the one built into Visual Studio, can easily be run in debug mode within the integrated development environment. Developers use this technique to step through the code as it executes, view variable values and call stacks, monitor procedure calls, and much more. None of these functions are available when running code in a remote HDInsight cluster on Windows Azure. However, there are some debugging techniques you can apply. This section contains information that will help you to understand how to go about debugging and testing when developing HDInsight solutions. Subsequent sections of this chapter discuss the following topics in more detail: Debugging by writing out significant values or messages during execution. You can add extra statements or instructions to your scripts or components to display the values of variables, export datasets, write messages, or increment counters at significant points during the execution. Obtaining debugging information from log files. You can monitor log files and standard error files for evidence that will help you locate failures or problems. Using a single-node cluster for testing and debugging. Running the solution in a local or remote single- node cluster can help to isolate issues with parallel execution of mappers and reducers.
Performing runtime debugging and single-stepping through the code in map and reduce components is not possible within the Hadoop environment. If you want to perform this type of debugging, or run unit tests on components, you can create an application that executes the components locally, outside of Hadoop, in a development environment that supports debugging; and use a mocking framework and test runner to perform unit tests.
However, Hadoop jobs may fail for reasons other than an error in the scripts or code. The two primary reasons are timeouts and bad input data. By default, Hadoop will abandon a job if it does not report its status or perform I/O activity every ten minutes. Typically most jobs will do this automatically, but some processor-intensive tasks may take longer. You can ensure that a map/reduce job written in Java is kept alive by calling methods of the TaskReporter instance such as setStatus and progress (see TaskReporter class on the Apache website), or by incrementing a counter. Another technique that you can apply to all types of jobs is to increase the value of the mapred.task.timeout property. If a job fails due to bad input data that the map and reduce components cannot handle, you can instruct Hadoop to skip bad records. While this may affect the validity of the output, skipping small volumes of the input data may be acceptable. For more information, see the section Skipping Bad Records in the MapReduce Tutorial on the Apache website. Writing Out Significant Values or Messages during Execution A traditional debugging technique is to simply output values as a program executes to indicate progress. It allows developers to check that the program is executing as expected, and helps to isolate errors in the code. You can use this technique in HDInsight in several ways: If you are using Hive you might be able to split a complex script into separate simpler scripts, and display the intermediate datasets to help locate the source of the error. If you are using a Pig script you can: Dump messages and/or intermediate datasets generated by the script to disk before executing the next command. This can indicate where the error occurs, and provide a sample of the data for you to examine. Call methods of the EvalFunc class that is the base class for most evaluation functions in Pig. You can use this approach to generate heartbeat and progress messages that prevent timeouts during execution, and to write to the standard log file. See Class EvalFunc<T> on the Pig website for more information. If you are using custom map and reduce components you can write debugging messages to the standard output file from the mapper and then aggregate them in the reducer, or generate status messages from within the components. For example: You can write messages to the standard error (stderr) log file, including the values of significant variables or execution information that can assist debugging. If you include error handling code in your components, use this approach to write messages to the standard error output. A standard error file is created on each node, so you may need to check all of the modes to find the problem. You can use the setStatus method of the TaskReporter class to update the task status using a string message; the progress method to update the percentage complete for a job; or the incrCounter method to increment a custom or built-in counter value. When you use a counter, the result is aggregated from all of the nodes. If the problem arises only occasionally, or on only one node, it may be due to bad input data. If skipping bad records is not appropriate, add code to your mapper class that validates the input data and reports any errors while attempting to parse or manipulate it. Write the details and an extract of the data that caused the problem to the standard error file. You can instruct Hadoop to store a representative sample of data from executing a job in the log file folder by setting the mapred.task.profile, mapred.task.profile.maps, and mapred.task.profile.reduces configuration properties for a job. See Profiling on the Apache Hadoop website for more information. You can kill a task while it is operating and view the call stack and other useful information such as deadlocks. To kill a job, execute the command kill -QUIT [job_id]. The job ID can be found in the Job Tracker portal (accessed on port 50030 of your clusters head node). The debugging information is written to the standard output (stdout) file.
The Hadoop engine generates a wealth of runtime information as it processes jobs. Before you start modifying scripts or code to find errors, take advantage of this information using the Job Tracker and the Health and Status Monitor portals on the Hadoop cluster. Obtaining Debugging Information from Log Files The core Hadoop engine in HDInsight generates a range of information in log files, counters, and status messages that is useful for debugging and testing the performance of your solutions. Much of this information is accessible through the Job Tracker portal (accessed on port 50030 of your clusters head node in a Remote Desktop session) and the Health and Status Monitor (accessed on port 50070). For more details of using the Job Tracker portal and the Health and Status Monitor, see the section Accessing Performance Information in Chapter 4 of this guide. The following list contains some suggestions to help you obtain debugging information from the log files generated by HDInsight: Use the Task page of Job Tracker portal to view number of errors. The Task Details page shows the errors, and provides links to the log files and the values of custom and built-in counters. View the history, job configuration, syslog, and other log files directly. For more information, see the section Accessing Hadoop Log Files Directly in Chapter 7. Change the default settings of the Apache log4j mechanism that generates log files to show additional information, or to change the logging behavior as required. The configuration file named log4j.properties is in the conf\ folder of the HDInsight installation, and you can access it using a Remote Desktop connection. Run a debug information script automatically to analyze the contents of the standard error, standard output, job configuration, and syslog files. Set the properties mapred.map.task.debug.script and mapred.reduce.task.debug.script to specify the scripts to run over the output from the map and reduce tasks.
For more information about running debug scripts, see How to Debug Map/Reduce Programs and the Debugging section of the MapReduce Tutorial on the Apace website. Using a Single-node Local Cluster for Testing and Debugging Sometimes errors can occur due to the parallel execution of multiple map and reduce components on a multi- node cluster. One way to isolate this problem is to execute the solution on a single-node cluster. You can create a single-node HDInsight cluster in Windows Azure by specifying the advanced option when creating the cluster. Alternatively you can install the single-node development environment on your local computer and execute the solution there. You can download the single-node development version of HDInsight from the Microsoft Download Center. By using a single-node local cluster you can rerun failed jobs and adjust the input data, or use smaller datasets, to help you isolate the problem. Set the keep.failed.task.files configuration property to true to instruct the data node retain failed task files when a job fails. These files may assist you in locating errors in the code. To rerun the job go to the \taskTracker\task-id\work folder and execute the command % hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml. This runs the failed task in a single Java virtual machine over the same input data. You can also turn on debugging in the Java virtual machine to monitor execution. More details of configuring the parameters of a Java virtual machine can be found on several websites, including Java virtual machine settings on the IBM website. Executing HDInsight Jobs Remotely Previous chapters in this guide have described how to execute map/reduce jobs on an HDInsight cluster using custom map/reduce code, Pig, and Hive. While you can execute these jobs by using the interactive console in the HDInsight management portal or in a remote desktop session with the cluster, a common requirement is to initiate and manage jobs remotely. Instantiating jobs remotely can be achieved in a range of ways. For example, you can: Use the WebHCatalog API in the .NET SDK for Hadoop to interact with HCatalog (which manages metadata and can execute jobs). This is a good option when you want to: Execute jobs in the cluster using code running on a client machine. Take advantage of the abstractions provided in the SDK to simplify the code. Use the built-in connection-retry mechanism in the SDK classes to minimize the chance of connectivity issues to the cluster. Build end-to-end applications that load, process, and extract data and generate reports or visualizations automatically. Use scripts or batch files that run on the name node of the cluster to execute tasks at the command line. This is a good option when you want to: Quickly create tasks that automate jobs within the cluster. Take advantage of the wide range of tools and features installed with HDInsight. Run jobs on a schedule, perhaps by using a Windows Scheduled Tasks.
You will need to use a Remote Desktop connection to communicate with the cluster and administer the processes when you use scripts and batch files that run on the cluster. Using WebHCatalog to Execute Jobs Remotely HCatalog provides an abstraction layer between jobs and file storage, and can be used to share common tabular metadata across map/reduce code, and Pig and Hive commands. HCatalog provides a REST-based API named WebHCatalog for executing commands remotely, and this API is the primary means for initiating data processing jobs from a client application. WebHCatalog is often referred to by its project name, Templeton. For more information about using WebHCatalog see Templeton on the HortonWorks website. The WebHCatalog API supports GET, PUT, and POST requests to execute commands that create or manipulate data resources on the cluster. You can make these requests directly in a browser by entering the appropriate URL, but more commonly you will create client applications that use the API by making HTTP requests to the cluster. Each request is addressed to the appropriate templeton/v1 endpoint in the HDInsight cluster. When you are using the default security settings for HDInsight the request must include a user.name parameter for the user name to be used to authenticate the request. If the request is made interactively over a connection with which no associated authentication credentials are currently associated, you are prompted to provide the password. When using the REST API programmatically you can associate credentials with the connection before making the first request. In addition to the user.name parameter, specific commands require their own parameters based on the operation being performed. Some commands also require a POST request, which you can generate using a tool such as cURL. For example, the following POST request can be used to store the results of a HiveQL query in a file. POST Request curl -d user.name=your-user-name -d execute="select * from hivesampletable;" -d statusdir="hivejob.output" -u your-user-name:your-password -k "https://your-cluster.azurehdinsight.net:563/templeton/v1/hive" Requests that start a map/reduce job return a response containing a job ID. You can use this job ID to track the status of long running jobs, as shown in the following GET request. GET URL https://your-cluster.azurehdinsight.net:563/templeton/v1/queue /job_id?user.name=your-user_name Youll need to enter these commands on a single line, and replace your-user-name, your-password, and your- cluster with the values for your HDInsight cluster. Using WebHCat from the .NET SDK for Hadoop In previous chapters of this guide you have seen how the .NET SDK for Hadoop provides many features that make working with HDInsight much easier that developing your own custom code. The SDK contains features to help you interact with storage, Oozie workflows, Ambari monitoring, and HCatalog; and for working with map/reduce code and scripts. The SDK also includes the WebHCat client (the WebHCatHttpClient class), which provides a wrapper around the WebHCatalog REST API. By using the WebHCat client you can implement managed code that executes commands on a remote HDInsight cluster much more easily than by writing your own code to construct and execute explicit HTTP requests. For .NET developers, the SDK for Hadoop is the ultimate toolbox for accomplishing a wide range of tasks. It makes most interactions with HDInsight much easier than developing your own library of functions. Figure 1 shows an overview of the WebHCat client within the context of the .NET SDK for Hadoop.
Figure 1 The WebHCat client in the .NET SDK for Hadoop To use the WebHCat client you must use NuGet to import the Microsoft.Hadoop.WebClient package. You can do this by executing the following command in the Package Manager PowerShell console. PowerShell install-package Microsoft.Hadoop.WebClient With the package and all its dependencies imported, you can create a custom application that automates WebHCatalog tasks in the cluster. For example, the following code shows how you can use the WebHCatHttpClient class to load data that you have already uploaded into Windows Azure blob storage for the cluster (perhaps by using the WebHDFSClient class) into a table. C# using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using Microsoft.Hadoop.WebHCat.Protocol;
namespace MyWebHCatApp { using System; using System.Threading.Tasks; using Microsoft.Hadoop.WebHCat.Protocol;
public class Program { private const string HadoopUser = "your-user-name"; private const string HadoopUserPassword = "your-password"; private const string Uri = "https://your-cluster.azurehdinsight.net:563";
public static void Main(string[] args) { // call async function to load data LoadTable().Wait();
Console.WriteLine("\nPress any key to continue."); Console.ReadKey(); }
// Load data into an existing table Console.WriteLine("Load data into the 'MyData' table..."); string command = @"LOAD DATA INPATH '/mydata/data.txt' INTO TABLE MyData;"; var job = await webHCat.CreateHiveJob(command, null, null, "LoadMyData", null);
// get the Job ID and return, leaving the job running var result = await job.Content.ReadAsStringAsync(); Console.WriteLine("Job ID: " + result); } } } For an example of using the WebHDFSClient class to upload a file to Windows Azure blob storage for a cluster, see the section Using the WebHDFSClient in the Microsoft .NET SDK for Hadoop in Chapter 3 of this guide. Creating and Using Automated Workflows Many data processing solutions require the coordinated processing of multiple jobs, often with a conditional work flow. For example, consider a solution in which data files containing web server log entries are uploaded each day, and must be parsed to load the data they contain into a Hive table. A workflow to accomplish this might consist of the following steps: 1. Insert data from the files located in the /uploads folder into the Hive table. 2. Delete the source files, which are no longer required.
This workflow is relatively simple but could be made more complex by adding some conditional tasks; for example: 1. If there are no files in the /uploads folder, go to step 5. 2. Insert data from the files into the Hive table. 3. Delete the source files, which are no longer required. 4. Send an email message to an operator indicating success, and stop. 5. Send an email message to an operator indicating failure.
Implementing these kinds of workflows is possible in a range of ways. For example, you could: Use the Oozie framework that is installed with HDInsight. This is a good option when: You are familiar with the syntax and usage of Oozie. You are prepared to use a Remote Desktop connection to communicate with the cluster to administer the processes. Use the Oozie client in the .NET SDK for Hadoop to execute Oozie workflows. This is a good option when: You are familiar with the syntax and usage of Oozie. You want to execute workflows from within a program running on a client computer. You are familiar with .NET and prepared to write programs that use the .NET Framework. Use SQL Server Integration Services (SSIS) or a similar integration framework. This is a good option when: You have SQL Server installed and are experienced with writing SSIS workflows. You want to take advantage of the powerful capabilities of SSIS workflows. The process for creating SSIS workflows is described in more detail in Scenario 4, BI Integration, and in Appendix A of this guide. Use the Cascading abstraction layer software. This is a good choice when: You want to execute complex data processing workflows written in any language that runs on the Java virtual machine. You have complex multi-level workflows that you need to combine into a single task. You want to control the execution of the map and reduce phases of jobs directly in code. Use a third party framework such as Hamake or Azkaban. This is a good option when: You are familiar with these tools. They offer a capability that is not available in other tools. Create a custom application or script that executes the tasks as a workflow. This is a good option when: You need a fairly simple workflow that can be expressed using your chosen programming or scripting language. You want to run scripts on a schedule, perhaps driven by Windows Scheduled Tasks. You are prepared to use a Remote Desktop connection to communicate with the cluster to administer the processes.
Creating Workflows with Oozie Oozie is the most commonly used mechanism for workflow development in Hadoop. It is a tool that enables you to create repeatable, dynamic workflows for tasks to be performed in an HDInsight cluster. The tasks themselves are specified in a control dependency direct acyclic graph (DAG), which is stored in a Hadoop Process Definition Language (hPDL) file named workflow.xml in a folder in the HDFS file system on the cluster. For example, the following hPDL document contains a DAG in which a single step (or action) is defined. hPDL <workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf"> <start to="hive-node"/> <action name="hive-node"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>default</value> </property> <property> <name>oozie.hive.defaults</name> <value>hive-default.xml</value> </property> </configuration> <script>script.q</script> <param>INPUT_TABLE=HiveSampleTable</param> <param>OUTPUT=/results/sampledata</param> </hive> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Hive failed, error message[${wf:errorMessage( wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> The action itself is a Hive job that is defined in a HiveQL script file named script.q, and has two parameters named INPUT_TABLE and OUTPUT. The code in script.q is shown in the following example. HiveQL INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM ${INPUT_TABLE} The script file is stored in the same folder as the workflow.xml hPDL file, along with a standard file for Hive jobs named hive-default.xml. A configuration file named job.properties is then stored on the local file system of the computer on which the Oozie client tools are installed. This file contains the settings that will be used to execute the job, as shown in the following example. job.properties nameNode=asv://my_container@my_asv_account.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/workflowfiles/ To initiate the workflow the following command is executed on the computer where the Oozie client tools are installed. Command Line oozie job -oozie http://localhost:11000/oozie/ -config c:\scripts\job.properties run When Oozie starts the workflow it returns a job ID in the format 0000001-123456789123456-oozie-hdp-W. You can check the status of a job by opening a Remote Desktop connection to the cluster and using a web browser to navigate to http://localhost:11000/oozie/v0/job/job-id?show=log. If you are familiar with .NET programming techniques, the easiest way to use Oozie is through a client application that uses the .NET SDK for Hadoop. This approach also makes it easy to schedule jobs using Windows utilities or Scheduled Tasks. Using Oozie from the .NET SDK for Hadoop The .NET SDK for Hadoop includes an OozieClient class that you can use to implement managed code that initiates and manages Oozie workflows. Figure 2 shows an overview of this class within the context of the .NET SDK for Hadoop.
Figure 2 The Oozie client in the .NET SDK for Hadoop To use the Oozie client you must use NuGet to import the Microsoft.Hadoop.WebClient package. You can do this by executing the following command in the Package Manager PowerShell console. PowerShell install-package Microsoft.Hadoop.WebClient After the package and all its dependencies are installed you can use the OozieClient class to initiate workflow jobs and check job status, as shown in the following code sample. C# using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using Microsoft.Hadoop.WebClient.OozieClient; using Microsoft.Hadoop.WebClient.OozieClient.Contracts;
namespace OozieApp { using System; using System.Threading.Tasks; using Microsoft.Hadoop.WebClient.OozieClient; using Microsoft.Hadoop.WebClient.OozieClient.Contracts;
public class Program { private const string HadoopUser = "your-user-name"; private const string HadoopUserPassword = "your-password"; private const string Uri = "https://your-cluster.azurehdinsight.net:563"; private const string Asv = "asv://your-container@your-account.blob.core.windows.net";
public static void Main(string[] args) { // call async function to start workflow StartWorkflow().Wait(); Console.WriteLine("\nPress any key to continue."); Console.ReadKey(); }
private static async Task StartWorkflow() { OozieHttpClient client = new OozieHttpClient( new System.Uri(Uri), HadoopUser, HadoopUserPassword);
// Submit a new Job var prop = new OozieJobProperties( HadoopUser, Asv, "jobtrackerhost:9010", "/workflowfiles/", string.Empty, string.Empty);
var newJob = await client.SubmitJob(prop.ToDictionary()); var content = await newJob.Content.ReadAsStringAsync(); Console.WriteLine(content); Console.WriteLine(newJob.StatusCode);
// Get the job information var serializer = new JsonSerializer(); dynamic json = serializer.Deserialize( new JsonTextReader(new StringReader(content))); string id = json.id; var jobInfo = await client.GetJobInfo(id); var jobStatus = await jobInfo.Content.ReadAsStringAsync();
Console.WriteLine("Job Status:"); Console.WriteLine(jobStatus); } } } You will see a demonstration of using an Oozie workflow in Scenario 2, Extract, Transform, Load, of this guide and in the associated Hands-on Lab. Building End-to-End Solutions on HDInsight Automation and workflows allow you to replace all of the manual interaction for HDInsight and Hadoop with completely automated end-to-end solutions. For example, you can use the .NET SDK for Hadoop and other tools described in this chapter to create a client program that: Uses an SSIS workflow to connect to the URL of a service that publishes updated data you want to analyze. Uses the storage client in the Windows Azure SDK to push the new data into Windows Azure blob storage. Creates and provisions a new HDInsight cluster based on this blob storage container using the PowerShell Cmdlets for HDInsight. Executes Hive scripts to create the tables that will store the results. Runs an Oozie workflow that executes Hive, Pig, or custom map/reduce jobs to analyze the data and summarize the results into the Hive tables. Uses an SSIS workflow to extract the data into a database and execute a report generator or update a dashboard. Deletes the cluster using the PowerShell Cmdlets for HDInsight to minimize ongoing costs.
The chapters later in this guide describe some real-world scenarios for using HDInsight. The appendices at the end of this guide demonstrate several of the automation tasks described in this chapter. Summary This chapter described techniques that you can use to debug, test, and automate HDInsight jobs; and define workflows for complex data processing scenarios. All developers will, at some point, discover errors in their code and need to perform debugging and testing to resolve these errors. If you are familiar with working in Java you can complement testing by using mock objects and a test runner framework. This is typically most common when writing custom map/reduce code, though as you saw in this chapter there are also ways that you can apply simple debugging techniques to solutions that use map/reduce code, Pig, and Hive. In addition, by using the tools and techniques described in this chapter, you can implement repeatable complex data processing solutions that can easily be integrated with business applications or custom management tools. If you have discovered a question that you can reliably and repeatedly answer using HDInsight, you will need to find an automated solution that can relieve you of the need to manually perform the processes you probably used during development. More Information For more information about HDInsight, see the Windows Azure HDInsight web page. For more information about WebHCatalog, see the documentation at http://people.apache.org/~thejas/templeton_doc_latest/index.html. For more information about Oozie, see the Apache project site for Oozie at http://oozie.apache.org/. For more information about the Microsoft .NET SDK for Hadoop, see the CodePlex project site at http://hadoopsdk.codeplex.com/.
Chapter 7: Managing and Monitoring HDInsight This chapter ties up some loose ends by looking at topics related to managing and monitoring HDInsight clusters. Windows Azure HDInsight makes it very easy to create and set up a cluster. However, behind this simplicity there are many operations and processes that HDInsight performs just to create the cluster, and many more involved in executing and monitoring jobs in the cluster. Challenges As with any remote service or application, managing and monitoring its operation may appear to be more difficult than for a locally installed equivalent. However, remote management and monitoring technologies are widely available, and are an accepted part of almost all administration tasks these dayseven with services and applications running in a remote datacenter. In many cases the extension of these technologies to cloud-hosted services and applications is almost seamless. As an inherently remote environment, Windows Azure already has many remote administration features built in, incorporated into development tools such as Visual Studio, and available as an add-on from third parties; and there are plenty of tools that allow you to create your own custom administration utilities. The deceptively simple management portal that you see when you navigate to HDInsight in your web browser hides that fact that creating a cluster automatically provisions a number of virtual machines that run the Hadoop core code, provisions a Windows Azure SQL Database to store the metadata for the cluster, and links the cluster with the Windows Azure blob storage container that holds the data. Many parts of this infrastructure are accessible remotely and allow you to administer the cluster configuration and operation, perform specific management tasks, and monitor the cluster behavior.
The following sections of this chapter explore the common management tasks and options in HDInsight. Administering HDInsight clusters and jobs. Configuring HDInsight clusters and jobs. Retaining data and metadata when recreating a cluster. Monitoring HDInsight clusters and jobs.
Administering HDInsight Clusters and Jobs In addition to the Remote Desktop connection and the management portal, Hadoop includes portal pages for specific task areas of a cluster, and you can access these for your HDInsight cluster. For example, you can open a Remote Desktop connection and use the Map/Reduce Administration page of the job tracker host to view details of the allocation of tasks in your cluster, and you can use the Map/Reduce History Viewer page of the head node host to see information about individual jobs that the cluster executed. Administering HDInsight involves using a range of different tools and approaches. Some are part of the underlying Hadoop ecosystem while many others are third party tools, SDKs, and (of course) custom scripts. In terms of managing HDInsight, the most common operations you are likely to carry out are: Working with the HDInsight management portal to perform basic administration tasks interactively. Interacting with the HDInsight service APIs to perform tasks such as creating and deleting clusters, working with the source and output files, and managing jobs. You can use a range of tools, scripts, and custom utilities to access the HDInsight service management API. Interacting with Windows Azure using scripts, the Windows Azure PowerShell Cmdlets, or custom utilities to perform tasks not specific to HDInsight, such as uploading and downloading data.
Using the HDInsight Cluster Management Portal The primary point of interaction with HDInsight is the web-based cluster management portal. This provides a range of one-click options for accessing different features of HDInsight. The main Windows Azure Management Portal contains a section for HDInsight that lists all of your HDInsight clusters. Selecting a cluster displays a summary of the details of the cluster, and a rudimentary monitoring section that shows the activity of the cluster over the previous one or four hours. To open the management portal for a cluster you can either use the Manage link at the bottom of the dashboard window, or navigate directly to the cluster using the URL https://your-cluster-name. azurehdinsight.net/. You must sign in using the credentials you provided when you created the cluster. In terms of administration tasks, at the time of writing the management portal provides links to: The Interactive Console where you can execute queries and commands. The interactive console contains the JavaScript console where you can execute administration tasks such as managing files in the cluster (listing, deleting, and viewing the files), and executing Hadoop management commands. A Remote Desktop connection to the name node of the cluster that allows you to view the built-in Hadoop management websites, interact with the file system, and manage the operating system. The Monitor Cluster and Job History pages. These are described in more detail in the section Monitoring HDInsight Clusters and Jobs later in this chapter. Figure 1 shows the HDInsight cluster management portal main page.
Figure 1 The HDInsight cluster management portal main page
Using the Management Portal Interactive Console The Interactive Console link on the main HDInsight management portal opens a window where you can select either Hive or JavaScript. The JavaScript console can be used to execute a range of commands related to administering HDInsight. For example, you can list current jobs, evaluate expressions, show the contents of objects, show global variables, and get the results of previous jobs. To see a list of all of the available JavaScript commands type help() into the JavaScript console. To execute an HDFS or Hadoop command in the JavaScript console, precede it with the hash (#) character.
You can also type commands that are sent directly to Hadoop for execution within the core Hadoop environment. For example, you can list and copy files, interact with jobs, view information about queues, manage logging settings, and run other utilities. Using a Remote Desktop Connection The Remote Desktop link in the main HDInsight management portal downloads a Remote Desktop Protocol (RDP) link that you can open to connect to the name node of the cluster. You must provide the administrator user name and password to connect, and after successful authentication you can work with the server in the same way as you would with a local computer. On the desktop of the name node computer is a link named Hadoop Command Line that opens a command window on the main folder where Hadoop is installed. You can then run the tools and utilities included with HDInsight, send commands to Hadoop, and submit jobs to the cluster. Figure 2 shows the Hadoop command line in a Remote Desktop session.
Figure 2 The Hadoop command line in a Remote Desktop session For a full list and description of the Hadoop management tools and utilities, see the Commands Manual on the Apache Hadoop website. Using the HDInsight Service Management API Windows Azure exposes a series of management APIs that you can use to interact with most parts of the remote infrastructure. The service management API for Windows Azure HDInsight allows you to remotely execute many of the tasks that you can carry out through the management portal web interface, and several other tasks that are not available in the web interface. These tasks include the following: Listing, deleting, and creating HDInsight clusters. Submitting jobs to HDInsight. Executing Hadoop management tools and utilities to create archives, copy files, set logging levels, rebalance the cluster, and more. Managing workflows that are execute by Oozie. Managing catalog data used by HCatalog. Accessing monitoring information from Ambari.
See the section Executing HDInsight Jobs Remotely in Chapter 6 for information on how to submit jobs remotely to an HDInsight cluster. See the section Monitoring HDInsight Clusters and Jobs later in this chapter for more details of how you can monitor an HDInsight cluster. HDInsight includes WebHCat, which is sometimes referred to by the original project name Templeton. WebHCat provides access to the HDInsight service management API on port 563 of the clusters name node, and exposes it as a REST interface. Therefore, you must submit appropriately formatted HTTPS requests to WebHCat to carry out management tasks. For example, to access Oozie on the HDInsight cluster you could send the following GET request: https://your-cluster.azurehdinsight.net:563/oozie/v1/admin/ To view the Hadoop job queue you could send the following GET request: https://your-cluster.azurehdinsight.net:563/templeton/v1/queue/ Creating REST URLs and Commands Directly Its possible to execute both GET and POST commands by using a suitable utility or web browser plugin. For example, you might decide to use: The Simple Windows Azure REST API Sample Tool available from Microsoft. The Postman add-in for the Google Chrome browser. The RESTClient add-in for the Mozilla Firefox browser. You can also use a utility such cURL, which is available for a wide range of platforms including Windows, to help you construct REST calls directly. Using Code to Access the REST APIs You can make calls to WebHCat and the service management API from within your own programs and scripts. Using programs or scripts, perhaps with optional or mandatory parameters that are supplied at runtime, can simplify management tasks and make them more repeatable with less chance of error. Programs or scripts can be written using the .NET Framework with any .NET language, in PowerShell, or using any of the REST client libraries that are available. The .NET SDK for Hadoop provides several tools and utilities that make it much easier to work with HDInsight; not only for administration tasks but for developing applications as well. As discussed in previous chapters, Microsoft provides the .NET SDK For Hadoop, which (amongst other things) is a library of tools that make it easy to interact with the service management API. Figure 3 shows an overview of the integration of the .NET SDK for Hadoop with HDInsight and the underlying Hadoop environment. At the time of writing the programmatic cluster management feature was not complete.
Figure 3 Overview of the .NET SDK for Hadoop and its integration with HDInsight The SDK includes the WebClient library that is designed to make it easy to interact with HDInsight. This library contains client libraries for use with the HDInsight administration interfaces for Oozie, HDFS, Ambari, and WebHCat. The WebHDFS client library works with files in HDFS and in Windows Azure blob storage. The WebHCat client library manages the scheduling and execution of jobs in HDInsight cluster, as well as providing access to the HCatalog metadata. It also includes tools to make it easier to manipulate the results of a query using the familiar Linq programming style, and a set of helpers that simplify writing map/reduce components in .NET languages.
See the section Using the Hadoop Streaming Interface in Chapter 4 for more information about writing map/reduce components in .NET languages. You can download and install the .NET SDK for Hadoop using the NuGet Package Manager in Visual Studio. You simply execute the following commands to add the libraries and references to the current Visual Studio project: Visual Studio Command Line install-package Microsoft.Hadoop.MapReduce install-package Microsoft.Hadoop.Hive install-package Microsoft.Hadoop.WebClient install-package Microsoft.WindowsAzure.Management.HDInsight
After installation you can use the classes in the SDK in your code to administer an HDInsight cluster; for example, you can: Use the methods of the WebHDFSClient class to set properties of HDFS; create, delete, open, write to, and get the status of files; and create, delete, rename and access directories. Use the methods of the WebHCatHttpClient class to create, delete, and get the status of Hive, Pig, and streaming map/reduce jobs. Use the methods of the OozieHttpClient class to submit, start, kill, suspend, resume, and monitor Oozie jobs. Use the methods of the AmbariClient class to retrieve basic information about the cluster, statistics for map/reduce jobs, and storage bandwidth usage. Use the methods of the ClusterProvisioningClient class to create, list, and delete clusters.
The SDK also includes a set of PowerShell cmdlets for provisioning, listing, deleting, and managing HDInsight clusters. You can use these in your scripts to perform tasks through the HDInsight service management API.
For more information, see Using the Hadoop .NET SDK with the HDInsight Service on the Windows Azure website and the Documentation page of the .NET SDK for Hadoop on Codeplex.
Interacting with Windows Azure Blob Storage Many administrative tasks may require access to the Azure Storage Vault (ASV) in Windows Azure blob storage that HDInsight uses. For example, you may need to upload or download files, or manage the files stored there. You can use the JavaScript console, the Hadoop command line, the Windows Azure PowerShell Cmdlets, any of the many standalone utilities available for accessing Windows Azure storage; or you can use a client library that makes it easier to access storage programmatically if you want to automate the operations. You can also use Visual Studios Server Explorer to access a Windows Azure storage account, though other third party tool such as Windows Azure Storage Explorer provide additional functionality and may be a better choice for interactive management of the HDInsight ASV. If you need to include access to the ASV in a script or program you can use the features of the .NET SDK For Hadoop, which is described in the previous section of this chapter. You can also use the sample BlobManager library package provided with this guide as a starting point to build a custom upload utility. This package allows you to use up to 16 threads to upload the data in parallel, break down the files into 4 MB blocks for optimization in HDFS, and merge blocks of less than 4 MB into a single file. You can also integrate the package with tools and frameworks such as SQL Server Integration Services. To obtain the BlobManager library, download the Hands-on Labs associated with this guide from http://wag.codeplex.com/releases/view/103405. The source code for the utility is in the Lab 4 - Enterprise BI Integration\Labfiles\BlobUploader subfolder of the examples. For more information about working with Windows Azure blob storage and HDInsight ASV, see How to Upload Data to HDInsight on the Windows Azure website, and Chapter 3 of this guide. Configuring HDInsight Clusters and Jobs Setting up and using a Windows Azure HDInsight cluster is easy using the Windows Azure Management Portal. There are just a few options to set, such as the cluster name and the storage account, and the cluster can be used without requiring any other setup or configuration. However, the Hadoop configuration is extremely flexible and wide ranging, and can be used to ensure that your solution in reliable, secure, and efficient. The configuration mechanism allows configuration settings (or configurable properties) to be specified at the cluster level, at the individual node (site) level, and for individual jobs that are submitted to the cluster for processing. There are also configuration properties for each of the tools (or daemons) such as HDFS, the map/reduce engine, Hive, Oozie, and more. You can adjust the configuration settings after you have created a cluster in a number of ways: You can use the SET statement to set the value of a property in a script, which changes the value just for that job at runtime. You can add parameters that set the values of properties at runtime when you execute a job at the command line, which changes the value just for that job. You can edit the service configuration files through a Remote Desktop session to the name node of the cluster to permanently change the values of properties for the cluster.
In most cases you should consider setting configuration property values at runtime using the SET statement or a command parameter. This allows you to specify the most appropriate settings for each job, such as compression options or other query optimizations. However, sometimes you may need to edit the configuration files to set the properties for all jobs.
The following sections of this chapter describe these techniques, and provide information on where the settings are located in the configuration files. The important points to note are: If you use the SET statement in a script or a command parameter to change settings at runtime, the setting will apply only for the duration of that job. This is typically appropriate for settings such as compression or query optimization because each job may require different settings. If you change the values in the configuration files, the settings are applied to all of the jobs you execute, and so you must ensure that these settings are appropriate for all subsequent jobs, or edit them as required for each job. Where jobs require different configuration settings you should avoid editing the configuration files if possible. Changes you make to configuration files are lost if you delete and recreate the cluster. The new cluster will always have the default HDInsight configuration settings. You cannot simply upload the old configuration files to the new cluster because details of the cluster, such as the IP addresses of the nodes and the database configuration, will differ each time. In some cases you can maintain sets of edited configuration files in the cluster and choose which is applied to each job at runtime. This may prove a useful way to set properties that cannot be set dynamically at runtime using the SET statement or a command parameter.
Setting Properties at Runtime In general you should consider setting the values of configuration properties at runtime when you submit a job to the cluster. In a Hive query you can use the SET statement to set the value of a property. For example, the following statements at the start of a query set the compression options for that job. HiveQL SET mapred.output.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.output.compression.codec= org.apache.hadoop.io.compress.GzipCodec; ...
Alternatively, when executing a Hadoop command, you can use the -D parameter to set property values, as shown here. Command Line hadoop [COMMAND] D property=value
The space between the -D and the property name can be omitted.
You can also specify that the cluster should use an alternative set of configuration files for this job, which allows you to maintain different groups of property settings and avoids the need to add them to every command line. In addition, you can execute a range of user or administrative commands such as setting the logging level.
The general syntax of a command is as follows (all of the parameters are optional): Command Line hadoop [--config confdir] [COMMAND] [GENERIC_OPTS] [COMMAND_OPTS] confdir specifies the path of the folder containing alternative configuration settings. COMMAND is the job command to execute, such as a map/reduce query. GENERIC_OPTS are option settings specific to this job, prefixed with a single hyphen. In addition to the -D parameter to set a property you can specify the files, archives, JAR files, or other parameters required to execute the job. COMMAND_OPTS are user or administrative commands to execute.
For more details of the options you can use in a job command, see the Apache Hadoop Commands Manual. Its also possible to access and manipulate configuration information using code in conjunction with the Configuration class. For more information see the Apache Hadoop reference guide to the Configuration class.
Setting Properties by Editing the Configuration Files The management portal provides a link that connects to the name node of your cluster over Remote Desktop Protocol (RDP). You can use this approach to access and edit the Hadoop configuration files within your HDInsight cluster. Figure 4 shows a schematic view of the configuration files for the cluster, a site within the cluster, the individual daemons.
Figure 4 Schematic view of the configuration options for HDInsight For a full list of all of the configuration settings, and descriptions of their usage, go to the Current Documentation page of the Apache Hadoop site and use the links in the Configuration section of the navigation bar to select the appropriate configuration file. HDInsight Cluster and Security Configuration HDInsight contains three main configuration files that determine the behavior of the cluster as a whole, as shown in the following table. With the exception of the first item in the table, all of the configuration files are in the folder C:\apps\dist\hadoop-[version]-SNAPSHOT\conf.
File Description C:\Config\[your-cluster-id] .IsotopeHeadNode_IN_0.1.xml This file contains the network topology information and runtime details of the HDInsight cluster. You should not edit this file. hadoop-policy.xml This file contains the authorization settings for the features of the cluster such as connecting to the nodes of the cluster, submitting jobs, and performing administrative tasks. By default each operation allows any user (*). You can replace this with a comma-delimited list of permitted user names or user group names. mapred-queue-acls.xml This file contains the authorization settings for the map/reduce queues. If you add the mapred.acls.enabled property with the value true to the configuration file conf\mapred-site.xml you can use the queue settings to specify a comma- delimited list of user names or user group names that can access the job submission queue and the administration queue. capacity-scheduler.xml This file contains the scheduling settings for tasks that Hadoop will execute. You can change settings that specify a wide range of factors that control how users jobs are queued, prioritized, and managed. The IsotopeHeadNode file defines the configuration of the cluster, and you should avoid changing any values in it. The next two files shown in the table define the authorization rules for the cluster. You can specify the list of user account names or user account group names for a wide range of operations by editing the hadoop- policy.xml file. For example, you can control which users or groups can submit jobs by setting the following property: XML <property> <name>security.client.protocol.acl</name> <value>username1,username2 groupname1,groupname2</value> </property> The ACL is a comma-separated list of user names and a comma-separated list of group names, with the two lists separated by a space. A special value of "*" means all users are allowed. To control which users and groups can access the default and the administration queues, set the mapred.queue.default.acl-submit-job and mapred.queue.default.acl-administer-jobs properties in the mapred-queue-acls.xml configuration file. For example, you can control which users or groups can access the administration queue to view job details, and modify a jobs priority, by setting the following property: XML <property> <name>mapred.queue.default.acl-administer-jobs</name> <value>username1,username2 groupname1,groupname2</value> </property> As with the hadoop-policy.xml file, the ACL is a comma-separated list of user names and a comma-separated list of group names, with the two lists separated by a space. A special value of "*" means all users are allowed, and a single space means no users are allowed except the job owner. Note that you must add the mapred.acls.enabled property the conf\mapred-site.xml configuration file and set it to true in to enable the settings in the mapred-queue-acls.xml configuration file. You can also over-ride the setting for administrators by adding the mapreduce.cluster.administrators property to the conf\mapred- site.xml configuration file and specifying the ACLs. The last file listed in the previous table, capacity-scheduler.xml, can be used to manage the allocation of jobs submitted by users, and allow higher priority jobs to run as soon as possible. As a simple example, the following settings limit the maximum number of jobs that can be queued for processing by the cluster to 100 and the maximum number of jobs that can be queued for each user to 10. XML <property> <name>mapred.capacity-scheduler.default-maximum-active- tasks-per-queue</name> <value>100</value> </property> <property> <name>mapred.capacity-scheduler.default-maximum-active- tasks-per-user</name> <value>10</value> </property>
HDInsight Site Configuration Hadoop includes a set of default XML configuration files for the site (each node in the cluster) that are read-only, and act as the templates for the site-specific configuration files. The default files contain a description for each of the property settings. The main default files are shown in the following table. File Description C:\apps\dist\hadoop-[version]- SNAPSHOT\src\core\core-default.xml This file contains the global property settings for Hadoop features including logging, the file system I/O settings, inter-site and client connectivity, the web interfaces, the network topology, and authentication for the web consoles. C:\apps\dist\hadoop-[version]- SNAPSHOT\src\hdfs\hdfs-default.xml This file contains the property settings for HDFS such as the IP addresses of the cluster members, the replication settings, authentication information, and directory folder locations. C:\apps\dist\hadoop-[version]- SNAPSHOT\src\mapred\mapred- default.xml This file contains property settings that control the behavior of the map/reduce engine. This includes the job tracker and task tracker settings, the memory configuration, data compression, processing behavior and limits, authentication settings, and notification details. C:\apps\dist\oozie-[version]\oozie- [version]\conf\oozie-default.xml This file contains property settings that allow you to specify the behavior of Oozie such as the purging of old workflow jobs, the queue size and thread count, concurrency, throttling, timeouts, the database connection details for storing metadata, and much more.
In most cases you should set the same configuration property values on every server in the cluster. This is particularly true in HDInsight where all of the cluster members are equivalent in virtual machine size, power, resources, and processing speed. Hadoop includes a rudimentary replication mechanism that can synchronize the configuration across all the servers in a cluster. However, where you set up an on-premises Hadoop cluster using different types of servers for the data nodes you may need to edit the configuration on individual servers.
You should not edit these default files. Instead, you edit the equivalent files located in the C:\apps\dist\hadoop- [version]-SNAPSHOT\conf\ folder of each site: core-site.xml, hdfs-site.xml, and mapred-site.xml. These site- specific files do not contain all of the property settings that are in the default files; they merely over-ride the default settings for that site. If you want to over-ride a property setting inherited from the default configuration that is not included in the site-specific configuration file you must copy the element into the site-specific file. For example, to change the setting for the amount of memory allocated to the buffer that is used for sorting items during a map/reduce operation, you would need to copy the following element from the mapred- default.xml file (in C:\apps\dist\hadoop-[version]-SNAPSHOT\src\mapred\) into the into the mapred-site.xml file in (C:\apps\dist\hadoop-[version]-SNAPSHOT\conf\): XML <property> <name>io.sort.mb</name> <value>150</value> <description>The total amount of buffer memory to use while sorting files, in megabytes.</description> </property> Some of the settings that it may be useful to edit are described in the query performance tuning section of this guide. For more information, see the section Tuning Query Performance in Chapter 4 of this guide.
You can see the values of all the properties used by a job in the Job Tracker portal. Open a Remote Desktop connection to the cluster from the main Windows Azure HDInsight portal and choose the desktop link named Hadoop MapReduce Status. Scroll to the Retired Jobs section of the Map/Reduce Administration page and select the job ID. In the History Viewer page for the job select the JobConf link near the top of the page.
HDInsight Tools Configuration Most of the tools and daemons that are included in HDInsight have configuration files that determine their behavior. The following table lists some of these files. All of these are in subfolders of the C:\apps\dist\ folder. File Description hive-[version] \conf \hive-site.xml This file contains the settings for the Hive query environment. You can use it to specify most features of Hive such as the location of temporary test and data files, concurrency support, the maximum number of reducer instances, task progress tracking, and more. templeton-[version] \conf\templeton-site.xml This file contains the settings for WebHCat (also referred to as Templeton, hence the file name), which is a REST API for metadata management and remote job submission. Most of the settings are the location of the services and code libraries used by WebHCat; however, you can use this file to set the port number that WebHCat will listen on. sqoop-[version]\conf\sqoop- site.xml This file contains the settings for the Sqoop data transfer utility. You can use it to specify connection manager factories and plugins that Sqoop will use to connect to data stores, manage features of these connections, and specify the metastore that Sqoop will use to store its metadata. oozie-[version]\conf\oozie- site.xml This file contains settings for the Oozie workflow scheduler. You can use it to specify authorization details, configure the queues, and provide information about the metadata store Oozie will use. \oozie-[version]\oozie- server\conf\web.xml This file contains settings that specify the servlets Oozie will use to respond to web requests. It also contains the settings for other features such as session behavior, MIME types, default documents, and more. For more information about these configuration files go to the projects list page of the Apache website and use the index to find the tool or project you are interested in, then follow the Project Website link in the main window. For example, the project website for Hive is http://hive.apache.org. Most of the project sites have a Documentation link in the navigation bar. Retaining Data and Metadata for a Cluster Windows Azure HDInsight is a pay for what you use service. When you just want to use a cluster for experimentation, or to investigate interesting questions against some data, you will usually tear down the cluster when you have the answer (or when you discover that there is no interesting question for that particular data) in order to minimize cost. However, there may be scenarios where you want to be able to release a cluster and restore it at a later time. For example, you may want to run a specific set of processes on a regular basis, such as once a month, and just upload new data for that month while reusing all the Hive tables, indexes, UDFs, and other objects you define in the cluster. Alternatively, you might want to scale-out by adding more cluster nodes. The only way to do this is to delete the cluster and recreate it with the number of nodes you now require. The data you use in an HDInsight cluster is stored in Windows Azure blob storage, and so it can be persisted when you tear down a cluster. When you create a new cluster you can specify an existing storage account that holds your cluster data as the source for the new cluster. This means that you immediately have access to the original data, and the data remains available after the cluster has been deleted. However, the metadata for the cluster, which includes the Hive table and index definitions, is stored in a SQL Server database that is automatically provisioned when you create a new cluster, and destroyed when the cluster is removed from HDInsight. If you want to be able to delete and then recreate a cluster you have two choices: You can allow HDInsight to destroy the metadata and remove the SQL Database when you delete the cluster, then run scripts that recreate the metadata in a new cluster. These scripts are the same as you used to create the Hive tables, indexes, and other metadata in the original cluster. You can configure the cluster to use a separate SQL Database instance that you can then retain when the cluster is deleted. After you recreate the cluster based on the same blob storage container in an existing storage account, you configure the new cluster to use the existing SQL Database containing the metadata.
The first of these options is simpler, and means that you do not need to preconfigure the cluster or change the configuration of the existing cluster, although you must create the original cluster over an existing blob container in your Windows Azure storage account by choosing the Advanced Create option. However, you must have access to all of the scripts (you can combine them into one script) to recreate all of the metadata. The only real issue may be the time it takes to execute the script if the metadata is very complex. The second of these options, described in more detail in the next section of this chapter, is the more complicated; but it may be a good choice if you have very large volumes of metadata. Windows Azure storage automatically maintains replicas of the data in your blobs, and SQL Database automatically maintains copies of the data it contains, so your data will not be lost should a disaster occur at a datacenter. However, this does not protect against accidental or malicious damage to the data because the changes will be replicated automatically. If you need to keep a backup for disaster recovery you can copy the blob data to another datacenter storage account or to an on-premises location, and export a backup from the SQL Database that stores the metadata as a file you can store in another storage account or an on-premises location. Configuring a Cluster to Support Redeployment To maintain the metadata in a cluster, and to be able to access it in a new cluster, you can change the location of the metadata storage by following these steps: Open a Remote Desktop connection to your HDInsight cluster and obtain its public IP address from the following section of the configuration file named your-cluster-id.IsotopeHeadNode_IN_0.1.xml (located in the c:\apps\dist folder): XML ... <Instances> <Instance id="IsotopeHeadNode_IN_0" address="cluster-ip-address"> Create a new Windows Azure SQL Database instance, and add the clusters IP address to the list of IP addresses that are permitted to access the server. Open the query window in SQL Database and execute the script HiveMetaStore.sql that is provided in the Hive distribution folder of HDInsight (C:\apps\dist\hive-version) to create the required tables in the database. In the Remote Desktop connection to your HDInsight cluster locate the file named hive-site.xml in the C:\apps\dist\hive-version\conf folder, and edit the following properties to specify the connection details for the SQL Database instance. XML <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:sqlserver:// your-server-name.database.windows.net: 1433; database=your-db-name; encrypt=true; trustServerCertificate=true; create=false </value> </property> ... <property> <name>javax.jdo.option.ConnectionUserName</name> <value>your-username@your-server-name</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>your-server-password</value> </property> The values of each property must be all on one line with no spaces. We had to break them up here due to page width limitations. In the Remote Desktop window open the Services application from the Administrative Tools menu and restart the Apache Hadoop metastore service.
After you restart the service HDInsight will store all of the metadata in your database, and it will not be deleted or recreated each time (the connection string in the javax.jdo.option.ConnectionURL property includes the instruction create=false). This means that Hive table, partition, and index definitions are available in a new cluster that uses this database as the metastore. If an existing cluster fails, or if you close it down and want to recreate it later, you do not need to rebuild all of the metadata. If you want to preserve the results of Hive queries you must ensure that they use the EXTERNAL and LOCATION keywords when you create the tables so that the data is not removed when the tables are deleted. See the section Managing Hive Table Data Location and Lifetime in Chapter 4 for more information. Automating Cluster Deletion and Provisioning As you saw earlier in this chapter, you can use scripts, utilities, and custom .NET applications to interact with the HDInsight service management API. The API includes operations that list, delete, and create clusters, and these can be used to automate cluster deletion and provisioning operations. For example, you could create a script that removes all of the clusters by iterating over the list exposed by the API and deleting each one in turn. As long as metadata storage has been redirected to your own separate Windows Azure SQL Database you will be able to recreate any clusters when you need to use them, but reduce your costs in the meantime. Alternatively, to provide disaster recovery, you could write a program that runs on a schedule to copy all of the data from the Windows Azure blob storage used by the cluster to another location (including an on-premises location if that is appropriate), and then export the SQL Database containing the metadata and save this backup file in another location. Just be aware of the data transfer costs that this will incur, especially if you have a great deal of data in your clusters blob storage. Scripts can be copied to the cluster through a Remote Desktop connection. However, keep in mind that these will not be available after you recreate the cluster. Monitoring HDInsight Clusters and Jobs There are several ways to monitor an HDInsight cluster and its operation: Use the HDInsight cluster management portal. Access the monitoring system directly. Access the Hadoop monitoring pages. Access the Hadoop log files directly.
The following sections of this chapter provide more information about each of these options. Using the HDInsight Cluster Management Portal After you connect to an HDInsight cluster management portal you can open the Monitor Cluster page. This page displays a selection of charts that show the metrics for the cluster including the number of mappers and reducers, the queue size, and the number of jobs. It also shows whether the name node and the job tracker are working correctly, and the health state of each node in the cluster. You drill down into each of the metrics and change the timescale for the chart from the default of the previous 30 minutes up to the previous two weeks, as shown in Figure 5.
Figure 5 Drilling down into the job history in the Monitor Cluster page
Viewing Job History The main HDInsight management portal page also allows you to access the Job History page. This page displays a list of previous and current jobs. It shows the command line that started the job, the start and end times, and the status for each job. You can select a job to see more details, or download the entire job history for examination in an application such as Microsoft Excel. For more information about using the Windows Azure HDInsight management portal, see How to Monitor HDInsight Service. Accessing the Monitoring System Directly The Monitor Cluster page in the Windows Azure HDInsight management portal uses the Ambari framework to obtain the data from the cluster. You can access Ambari directly using a client tool that can construct and send HTTPS requests to the WebHCat REST interface on port 563 of the name node to perform your own queries and obtain more details. For example, you can connect to Ambari monitoring interface using the following URL: https://your-cluster.azurehdinsight.net:563/ambari/api/monitoring Some of the client tools suitable for making these kinds of requests are listed in the section Creating REST URLs and Commands Directly earlier in this chapter. There is also a web client you can use to access Ambari in the .NET SDK for Hadoop, which is described in the section Using Code to Access the REST APIs earlier in this chapter. For information about using Ambari see Ambari API Reference in GitHub. Accessing the Hadoop Monitoring Pages The section Accessing Performance Information in Chapter 4 described how to access and use the Hadoop built-in websites to determine the behavior of a cluster in order to maximize performance. Several sections of these websites can be used to obtain more general monitoring information such as physical operations carried out by the job scheduler and the individual nodes in the cluster, the configuration settings applied to each job, and the time taken for each stage of the jobs. The easiest way to use access the Hadoop status and monitoring websites is to open a Remote Desktop connection to the cluster from the main Windows Azure HDInsight portal and use the links named Hadoop MapReduce Status and Hadoop Name Node Status located on the desktop of the remote virtual server. The Job Tracker portal is accessed on port 50030 of your clusters head node. The Health and Status Monitor is accessed on port 50070 of your clusters head node. Accessing Hadoop Log Files Directly With a Remote Desktop connection open to the name node of your cluster, you can browse the file system and examine the log files that Hadoop generates. By default the job history files (including the stdout and stderror files) are written to the central _logs\history folder of the Hadoop installation, and to the users job output folder. These log files can be accessed from the Hadoop command line using the commands shown here. Command Line $ bin/hadoop job -history <output-dir> $ bin/hadoop job -history all <output-dir> where <output-dir> is the folder containing the log files. If required, you can change the folder where log files are created by setting the hadoop.job.history.location and hadoop.job.history.user.location properties in the mapred-site.xml configuration file, or from the command line by using the -D parameter.
You can prevent any log files from being written to the user folder by setting the hadoop.job.history.user.location property value to none.
Changing the Logging Parameters Hadoop uses the Apache log4j mechanism in conjunction with the Apache Commons Logging framework to generate log files. The configuration for this mechanism can be found in the file log4j.properties in the conf\ folder. You can edit this file to specify non-default log formats and other logging settings if required. Using Hadoop Metrics Various parts of the Hadoop code expose metrics information that you can collect and use to analyze the performance and health of a cluster. There are four contexts for the information: the Java virtual machine, the distributed file system, the map/reduce core system, and the remote procedure context. Each can be configured separately to save metrics into files that you can examine. For more information, see Hadoop Metrics in the Cloudera blog and Package org.apache.hadoop.metrics2 on the Apache website. Summary This chapter described the common tasks and operations you are likely to carry out when administering a Windows Azure HDInsight cluster. It covered a range of management and monitoring tasks including interacting with the cluster using a range of tools and techniques, setting configuration properties, recreating a cluster and its data, and monitoring the cluster and the jobs it executes. HDInsight supports a range of techniques for remote administration and monitoring. These include using the HDInsight management portal pages, accessing the servers directly using a Remote Desktop connection, and interacting with the service management APIs that HDInsight exposes. This chapter does not attempt to cover every management option that is available, every configuration setting, or every feature of the APIs and the tools. However, it does provide an overview of the possibilities and techniques that you can use, and will help you to familiarize yourself with the tasks that you might need to carry out when using HDInsight. More Information For more information about HDInsight, see the Windows Azure HDInsight web page. For more information about the Microsoft .NET SDK for Hadoop, see the Codeplex project site.
Scenario 1: Iterative Data Exploration This chapter marks a change in style for the guide. To supplement the general information and guidance in the previous chapters, this and the following three chapters describe a series of practical scenarios for using a Big Data solution such as HDInsight. Each scenario focuses on a fictitious organization that has a different requirements, but which can be fulfilled by using HDInsight. The scenarios map to the four architectural approaches described in Chapter 2 of this guide: As a standalone data analysis experimentation and visualization tool. As a data transfer, data cleansing, or ETL mechanism. As a basic data warehouse or commodity storage mechanism. Integration with an enterprise data warehouse and BI system.
In this first scenario, which maps to the standalone data analysis experimentation and visualization approach, HDInsight is used to perform some basic analysis of data from Twitter. Social media analysis is a common Big Data use case, and this scenario demonstrates how to extract information from semi-structured data in the form of tweets. However, the goal of this scenario is not to provide a comprehensive guide to analyzing data from Twitter, but rather to show an example of iterative data exploration with HDInsight and demonstrate some of the techniques discussed in this guide. Specifically, this scenario demonstrates how to: Identify a source of data and define the analytical goals. Process data iteratively by using custom map/reduce code, Pig scripts, and Hive. Review data processing results to see if they are both valid and useful. Refine the solution to make it repeatable, and integrate it with the companys existing BI systems.
This chapter describes the scenario in general, and the solutions applied to meet the goals of the scenario. If you would like to work through the solution to this scenario yourself you can download the files and a step-by-step guide. These are included in the associated Hands-on Labs available from http://wag.codeplex.com/releases/view/103405. Finding Insights in Data In Chapter 1 of this guide you saw that one of the typical uses of a Big Data solution such as HDInsight is to explore data that you already have, or data you collect speculatively, to see if it can provide insights into information that you can use within your organization. Chapter 1 used the question flow shown in Figure 1 as an example of how you might start with a guess based on intuition, and progress towards a repeatable solution that you can incorporate into your existing BI systems. Or, perhaps, to discover that there is no interesting information in the data, but the cost of discovering this has been minimized by using a pay for what you use mechanism that you can set up and then tear down again very quickly and easily.
Figure 1 The iterative exploration cycle for finding insights in data Introduction to Blue Yonder Airlines This scenario is based on a fictitious company named Blue Yonder Airlines. The company is an airline serving passengers in the USA, and operates flights from its home hub at JFK airport in New York to Sea-Tac airport in Seattle and LAX in Los Angeles. The company has a customer loyalty program named Blue Yonder Points. Some months ago, the CEO of Blue Yonder Airlines was talking to a colleague who mentioned that he had seen tweets that were critical of the companys seating policy. The CEO decided that the company should investigate this possible source of valuable customer feedback, and instructed her customer service department to start using Twitter as a means of communicating with its customers. Customers send tweets to @BlueYonderAirlines and use the standard Twitter convention of including hashtags to denote key terms in their messages. In order to provide a basis for analyzing these tweets, the CEO also asked the BI team to start collecting any that mention @BlueYonderAirlines. Analytical Goals and Data Sources Initially the plan is simply to collect enough data to begin exploring the information it contains. To determine if the results are both useful and valid, the team must collect enough data to generate a statistically valid result. However, they do not want to invest significant resources and time at this point, and so use a manual interactive process for collecting the data from Twitter using the public API. Even though we dont know what we are looking for at this stage, we still have an analytical goal: to explore the data and discover if, as we are led to believe, it really does contains useful information that could provide a benefit for our business. Although there is no specific business decision under consideration, the customer services managers believe that some analysis of the tweets sent by customers may reveal important information about how they perceive the airline and the issues that matter to customers. The kinds of question the team expects to answer are: Are people talking about Blue Yonder Airlines on Twitter? If so, are there any specific topics that regularly arise? Of these topics, if any, is it possible to get a realistic view of which are the most important? Does the process provide valid and useful information? If not, can it be refined to produce more accurate and useful results? If the results are valid and useful, can the process be incorporated into the companys BI systems to make it repeatable?
Collecting and Uploading the Source Data The BI team at Blue Yonder Airlines has been collecting data from Twitter and uploading it to Windows Azure blob store ready to begin the investigation. They have gathered sufficient data over a period of weeks to ensure a suitable sample for analysis. The source data and results will be retained in Windows Azure blob storage, for visualization and further exploration in Excel, after the investigation is complete. A full description of the process the team uses to collect the data from Twitter is not included here. However, Appendix A of this guide describes the ways that the team at Blue Yonder Airlines developed and refined the data collection process over time, as they became aware that the analysis would be able to provide useful and valid information for the business. If you are just experimenting with data so see if it useful, you probably wont want to spend inordinate amounts of time and resources building a complex or automated data ingestion mechanism. Sometimes it just easier and quicker to use the interactive approach! The HDInsight Infrastructure The IT department at Blue Yonder Airlines does not have the capital expenditure budget to purchase new servers for the project, especially as the hardware may only be required for the duration of the project. Therefore, the most appropriate infrastructure approach is to provision an HDInsight cluster on Windows Azure that can be released when no longer required, minimizing the overall cost of the project. To further minimize cost the team began with a single node HDInsight cluster, rather than accepting the default of four nodes when initially provisioning the cluster. However, if necessary the team can provision a larger cluster should the processing load indicate that more nodes are required. At the end of the investigation, if the results indicate that valuable information can be extracted from the data, the team can fine tune the solution to use a cluster with the appropriate number of nodes by using the techniques described in the section Tuning the Solution in Chapter 4 of this guide.
For details of the cost of running an HDInsight cluster, see the pricing details on the Windows Azure HDInsight website. In addition to the cost of the cluster you must pay for a storage account to hold the data for the cluster. Iterative Data Processing After capturing a suitable volume of data and uploading it to Windows Azure blob storage the analysts in the BI team can configure an HDInsight cluster associated with the blob container holding the data, and then turn their attention to processing it. In the absence of a specific analytical goal, the data processing follows an iterative pattern in which the analysts explore the data to see if they find anything that indicates specific topics of interest for customers, and then build on what they find to refine the analysis. At a high level, the iterative process breaks down into three phases: Explore the analysts explore the data to determine what potentially useful information it contains. Refine when some potentially useful data is found, the data processing steps used to query the data are refined to maximize the analytical value of the results. Stabilize when a data processing solution that produces useful analytical results has been identified, it is stabilized to make it robust and repeatable.
The following sections of this chapter describe these stages in more detail.
Although many Big Data solutions will be developed using the stages described here, its not mandatory. You may know exactly what information you want from the data, and how to extract it. Alternatively, if you dont intend to repeat the process theres no point in refining or stabilizing it.
Phase 1 Initial Data Exploration In the first phase of the analysis, the team at Blue Yonder Airlines decided to explore the data speculatively by coming up with a hypothesis about what information the data might be able to reveal, and using HDInsight to process the data and generate results that validate the hypothesis. The goal for this phase is open-ended; the exploration might result in a specific avenue of investigation that merits further refinement of a particular data processing solution, or it might simply prove (or disprove) an assumption about the data. Using Hive to Explore the Volume of Tweets In most cases, the simplest way to start exploring data with HDInsight is to create a Hive table, and then query it with HiveQL statements. The analysts at Blue Yonder Airlines created a Hive table based on the source data obtained from Twitter. The following HiveQL code was used to define the table and load the source data into it. HiveQL CREATE EXTERNAL TABLE Tweets (PubDate STRING, TweetID STRING, Author STRING, Tweet STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/twitterdata/tweets';
LOAD DATA INPATH '/data/tweets.txt' INTO TABLE Tweets; The analysts hypothesized that the use of Twitter to communicate with the company is significant, and that the volume of tweets that mention the company is growing. They therefore used the following query to determine the daily volume and trend of tweets. HiveQL SELECT PubDate, COUNT(*) TweetCount FROM Tweets GROUP BY PubDate SORT BY PubDate; The results of this query are shown in the following table. PubDate TweetCount 4/16/2013 1964 4/17/2013 2009 4/18/2013 2058 4/19/2013 2107 4/20/2013 2160 4/21/2013 2215 4/22/2013 2274 These results seem to validate the hypothesis that the volume of tweets is growing, and it may be worth refining this query to include a larger set of source data that spans a longer time period, and potentially including other aggregations in the results such as the number of distinct authors that tweeted each day. However, while this analytical approach might reveal some information about the importance of Twitter as a channel for customer communication, it doesnt provide any information about the specific topics that concern customers. To determine whats important to the airlines customers, the analysts must look more closely at the actual contents of the tweets. Using Map/Reduce Code to Identify Common Hashtags Twitter users often use hashtags to associate topics with their tweets. Hashtags are words prefixed with a # character, and are typically used to filter searches and to find tweets about a specific subject. The analysts at Blue Yonder Airlines hypothesized that analysis of the hashtags used in tweets addressed to the airlines account might reveal topics that are important to customers. The team decided to find the hashtags in the data, count the number of occurrences of each hash tag, and from this determine the main topics of interest. Parsing the unstructured tweet text and identifying hashtags can be accomplished by implementing a custom map/reduce solution. HDInsight includes a sample map/reduce solution called WordCount that counts the number of words in a text source, and this can be easily adapted to count hashtags. The sample is provided in various forms including Java, JavaScript, and C#. The developers at Blue Yonder Airlines are familiar with JavaScript, and so they decided to implement the map/reduce code by modifying the JavaScript version of the sample. There are many ways to create and execute map/reduce functions, though often you can use Hive or Pig directly instead of resorting to writing map/reduce code. However, there is a map/reduce code example available that we can use as a starting point, which saves us time and effort. For simplicity we execute the code within Pig, instead of executing it as a map/reduce job directly. The script consists of a map function that runs on each cluster node in parallel and parses the text input to create a key/value pair with the value of 1 for every word found. It was relatively simple to modify the code to filter the words so that only those prefixed by # are included. These key/value pairs are passed to the reduce function, which counts the total number of instances of each key. Therefore, the result is the number of times each hash tag was mentioned in the source data. The modified JavaScript code is shown below. JavaScript // Map function - runs on all nodes var map = function (key, value, context) { // Split the data into an array of words var hashtags = value.split(/[^0-9a-zA-Z#_]/);
// Loop through array creating a name/value pair with // the value 1 for each word that starts with '#' for (var i = 0; i < hashtags.length; i++) { if (hashtags[i].substring(0, 1) == "#") { context.write(hashtags[i].toLowerCase(), 1); } } };
// Reduce function - runs on reducer node(s) var reduce = function (key, values, context) { var sum = 0; // Sum the counts of each tag found in the map function while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; The script was uploaded to the cluster as tagcount.js using the fs.put command, and executed by running the following Pig command from the interactive console in an HDInsight server. Interactive JavaScript Console pig.from("/twitterdata/tweets") .mapReduce("/twitterdata/tagcount.js", "tag, count:long").orderBy("count DESC") .take(5).to("/twitterdata/tagcounts");
Note that the statement in this listing and statements in listings elsewhere in this chapter have been broken down over several lines due to limitations of the width of the printed page.
This command executes the map/reduce job on the files in the /twitterdata/tweets folder (used by the tweets table that was created previously), sorts the results by the number of instances of each hashtag found in the data, filters the results to include only the top five hashtags, and stores the results in the /twitterdata/tagcounts folder. The job generates a file named part-r-00000 containing the top five most common hashtags and the total number of instances of each one. To visualize the results, the analysts used the following commands in the Interactive JavaScript console. Interactive JavaScript file = fs.read("/twitterdata/tagcounts"); data = parse(file.data, "tag, count:long"); graph.pie(data);
The pie chart generated by these commands is shown in Figure 2.
Figure 2 Results of a map/reduce job to find the top five hashtags
The pie chart reveals that the top five hashtags are: #seatac (2004 occurrences) #jfk (2003 occurrences) #blueyonderairlines (1604 occurrences) #blueyonder (1604 occurrences) #blueyonderpoints (802 occurrences)
However, these results are not particularly useful in trying to identify the most common topics discussed in the tweets, and so the analysts decided to widen the analysis beyond hashtags and try to identify the most common words used in the text content of the tweets. Using Pig to Group and Summarize Word Counts It would be possible to use the original WordCount sample to count the number of occurrences per word in the tweets, but the analysts decided to use Pig in order to simplify any customization of the code that may be required in the future. Pig provides a workflow-based approach to data processing that is ideal for restructuring and summarizing data, and a Pig Latin script that performs the aggregation is syntactically much easier to create than implementing the equivalent custom map/reduce code. Pig makes it easy to write procedural workflow-style code that builds a result by repeated operations on a dataset. It also makes it easier to debug queries and transformations because you can dump the intermediate results of each operation in the script to a file. However, before you commit a Pig Latin script to final production where it will be executed repeatedly, make sure you apply performance optimizations to it as described in Chapter 4 of this guide. The Pig Latin code created by the analysts is shown below. It uses the source data files in the /twitterdata/tweets folder to group the number of occurrences of each word, and then sorts the results in descending order of occurrences and stores the first 100 results in the /twitterdata/wordcounts folder. Pig Latin -- load tweets Tweets = LOAD '/twitterdata/tweets' AS (date, id, author, tweet);
-- split tweet into words TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;
-- clean words by removing punctuation CleanWords = FOREACH TweetWords GENERATE LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0)) as word;
-- filter text to eliminate empty strings. FilteredWords = FILTER CleanWords BY word != '';
-- group by word GroupedWords = GROUP FilteredWords BY (word);
-- count mentions per group CountedWords = FOREACH GroupedWords GENERATE group, COUNT(FilteredWords) as count;
-- sort by count SortedWords = ORDER CountedWords BY count DESC;
-- limit results to the top 100 Top100Words = LIMIT SortedWords 100;
-- store the results as a file STORE Top100Words INTO '/twitterdata/wordcounts'; This script is saved as WordCount.pig and executed from the Pig installation folder on the cluster by using the following command in the Hadoop command console: Command Line pig C:\Scripts\WordCount.pig When the script has completed successfully, the results are stored in a file named part-r-00000 in the /twitterdata/wordcounts folder. These can be viewed using the #cat command. Partial results are shown below. Extract from /twitterdata/wordcounts/part-r-00000 my 3437 delayed 2749 flight 2749 to 2407 entertainment 2064 the 2063 a 2061 delay 1720 of 1719 bags 1718 These results show that the word count approach has the potential to reveal some insights. For example, the high number of occurrences of delayed and delay are likely to be relevant in determining common customer concerns. However, the solution needs to be modified to restrict the output to include only significant words, which will improve its usefulness. To accomplish this the analysts decided to refine it to produce accurate and meaningful insights into the most common words used by customers when communicating with the airline by Twitter. Phase 2 Refining the Data Processing Solution After you find a question that does provide some useful insight into the data, its usual to attempt to refine it to maximize the relevance of the results, minimize the processing overhead, and therefore improve its value. This is typically the second phase of an analysis process.
Typically you will only refine solutions that you intend to repeat on a reasonably regular basis. Be sure that they are still correct and produce valid results afterwards. For example, compare the outputs of the original and the refined solutions. Having determined that counting the number of occurrences of each word has the potential to reveal common topics of customer tweets, the analysts at Blue Yonder Airlines decided to refine the data processing solution to improve its accuracy and usefulness. The starting point for this refinement is the WordCount.pig Pig Latin script described previously. Enhancing the Pig Latin Code to Exclude Noise Words The results generated by the WordCount.pig script contain a large volume of everyday words such as a, the, of, and so on. These occur frequently in every tweet but provide little or no semantic value in trying to identify the most common topics discussed in the tweets. Additionally, there are some domain-specific words such as flight and airport that, within the context of discussions around airlines, are so common that their usefulness in identifying topics of conversation is minimal. These noise words (sometimes called stop words) can be disregarded when analyzing the tweet content. The analysts decided to create a file containing the stop list of words to be excluded from the analysis, and then use it to filter the results generated by the Pig Latin script. This is a common way to filter words out of searches and queries. An extract from the noise word file created by the analysts at Blue Yonder Airlines is shown below. This file was saved as noisewords.txt in the /twitterdata folder in the HDInsight cluster. Extract from /twitterdata/noisewords.txt a i as do go in is it my the that what flight airline terrible terrific
You can obtain lists of noise words from various sources such as TextFixer and Armand Brahajs Blog. If you have installed SQL Server you can start with the list of noise words that are included in the Resource database. For more information, see Configure and Manage Stopwords and Stoplists for Full-Text Search. With the noise words file in place, the analysts modified the WordCount.pig script to use a LEFT OUTER JOIN matching the words in the tweets with the words in the noise list. Only words with no matching entry in the noise words file are then included in the aggregated results. The modified script is shown below. Pig Latin -- load tweets Tweets = LOAD '/twitterdata/tweets' AS (date, id, author, tweet);
-- split tweet into words TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;
-- clean words by removing trailing periods CleanWords = FOREACH TweetWords GENERATE LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0)) as word;
-- filter text to eliminate empty strings. FilteredWords = FILTER CleanWords BY word != '';
-- load the noise words file NoiseWords = LOAD '/twitterdata/noisewords.txt' AS noiseword:chararray;
-- join the noise words file using a left outer join JoinedWords = JOIN FilteredWords BY word LEFT OUTER, NoiseWords BY noiseword USING 'replicated';
-- filter the joined words so that only words with -- no matching noise word remain UsefulWords = FILTER JoinedWords BY noiseword IS NULL;
-- group by word GroupedWords = GROUP UsefulWords BY (word);
-- count mentions per group CountedWords = FOREACH GroupedWords GENERATE group, COUNT(UsefulWords) as count; -- sort by count SortedWords = ORDER CountedWords BY count DESC;
-- limit results to the top 100 Top100Words = LIMIT SortedWords 100;
-- store the results as a file STORE Top100Words INTO '/twitterdata/noisewordcounts'; Partial results from this version of the script are shown below. Extract from /twitterdata/noisewordcounts/part-r-00000 delayed 2749 entertainment 2064 delay 1720 bags 1718 service 1718 time 1375 vacation 1031 food 1030 wifi 1030 connection 688 seattle 688 bag 687 These results are more useful than the previous output, which included the noise words. However, the analysts have noticed that semantically equivalent words are counted separately. For example, in the results shown above, delayed and delay both indicate that customers are concerned about delays, while bags and bag both indicate concerns about baggage. Using a User-Defined Function to Find Similar Words After some consideration, the analysts realized that multiple words with the same stem (for example delay and delayed), different words with the same meaning (for example baggage and luggage), and mistyped or misspelled words (for example bagage") will skew the resultspotentially causing topics that are important to customers to be obscured because many differing terms are used as synonyms. To address this problem, the analysts initially decided to try to find similar words by calculating the Jaro distance between every word combination in the source data. The Jaro distance is a value between 0 and 1 used in statistical analysis to indicate the level of similarity between two words. For more information about the algorithm used to calculate Jaro distance, see JaroWinkler distance on Wikipedia. There is no built-in Pig function to calculate Jaro distance, so the analysts recruited the help of a Java developer to create a user-defined function. The code for the user-defined function is shown below: Java package WordDistanceUDF;
public class WordDistance extends EvalFunc<Double> { public Double exec(final Tuple input) throws IOException{ if (input == null || input.size() == 0) return null; try { final String firstWord = (String)input.get(0); final String secondWord = (String)input.get(1); return getJaroSimilarity(firstWord, secondWord); } catch(Exception e) { throw new IOException("Caught exception processing input row ", e); } }
private Double getJaroSimilarity(final String firstWord, final String secondWord) { double defMismatchScore = 0.0; if (firstWord != null && secondWord != null) { // get half the length of the string rounded up //(this is the distance used for acceptable transpositions)
final int halflen = Math.min(firstWord.length(), secondWord.length()) / 2 + 1; //get common characters final StringBuilder comChars = getCommonCharacters(firstWord, secondWord, halflen);
final int commonMatches = comChars.length();
//check for zero in common if (commonMatches == 0) { return defMismatchScore; }
final StringBuilder comCharsRev = getCommonCharacters(secondWord, firstWord, halflen); //check for same length common strings returning 0.0f is not the same if (commonMatches != comCharsRev.length()) { return defMismatchScore; } //get the number of transpositions int transpositions = 0; for (int i = 0; i < commonMatches; i++) { if (comChars.charAt(i) != comCharsRev.charAt(i)) { transpositions++; } } //calculate jaro metric transpositions = transpositions / 2; defMismatchScore = commonMatches / (3.0 * firstWord.length()) + commonMatches / (3.0 * secondWord.length()) + (commonMatches - transpositions) / (3.0 * commonMatches); } return defMismatchScore; }
private static StringBuilder getCommonCharacters(final String firstWord, final String secondWord, final int distanceSep) { if (firstWord != null && secondWord != null) { final StringBuilder returnCommons = new StringBuilder(); final StringBuilder copy = new StringBuilder(secondWord); for (int i = 0; i < firstWord.length(); i++) { final char character = firstWord.charAt(i); boolean foundIt = false; for (int j = Math.max(0, i - distanceSep); !foundIt && j < Math.min(i + distanceSep, secondWord.length()); j++) { if (copy.charAt(j) == character) { foundIt = true; returnCommons.append(character); copy.setCharAt(j, '#'); } } } return returnCommons; } return null; } } This function was compiled, packaged as WordDistanceUDF.jar, and saved in the Pig installation folder on the HDInsight cluster. Next, the analysts created a Pig Latin script that generates a list of all non-noise word combinations in the tweet source data and uses the function to calculate the Jaro distance between each combination. This script is shown below: Pig Latin -- register custom jar REGISTER WordDistanceUDF.jar;
-- split tweet into words TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;
-- clean words by removing trailing periods CleanWords = FOREACH TweetWords GENERATE LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0)) as word;
-- filter text to eliminate empty strings. FilteredWords = FILTER CleanWords BY word != '';
-- load the noise words file NoiseWords = LOAD '/twitterdata/noisewords.txt' AS noiseword:chararray;
-- join the noise words file using a left outer join JoinedWords = JOIN FilteredWords BY word LEFT OUTER, NoiseWords BY noiseword USING 'replicated';
-- filter the joined words so that only words -- with no matching noise word remain UsefulWords = FILTER JoinedWords BY noiseword IS NULL;
-- group by word GroupedWords = GROUP UsefulWords BY (word);
-- count mentions per group WordList = FOREACH GroupedWords GENERATE group as word:chararray;
-- sort by count SortedWords = ORDER WordList BY word;
-- create a duplicate set SortedWords2 = FOREACH SortedWords GENERATE word AS word:chararray;
-- cross join to create every combination of pairs CrossWords = CROSS SortedWords, SortedWords2;
-- find the Jaro distance WordDistances = FOREACH CrossWords GENERATE SortedWords::word as word1:chararray, SortedWords2::word as word2:chararray, WordDistanceUDF.WordDistance(SortedWords::word, SortedWords2::word) AS jarodistance:double;
-- filter out word pairs with jaro distance less than 0.9 MatchedWords = FILTER WordDistances BY jarodistance >= 0.9;
-- store the results as a file STORE MatchedWords INTO '/twitterdata/matchedwords';
Notice that the script filters the results to include only word combinations with a Jaro distance value of 0.9 or higher.
To view the results of running the script, the analysts used the Data Explorer add-in in Excel to connect to the HDInsight cluster and create a table from the part-r-00000 file that was generated in the /twitterdata/matchedwords folder, as shown in Figure 3.
Figure 3 Using the Data Explorer add-in to explore Pig output Partial results in the output file are shown below. Extract from /twitterdata/matchedwords/part-r-00000 bag bag 1.0 bag bags 0.9166666666666665 baggage baggage 1.0 bagage bagage 1.0 bags bag 0.9166666666666665 bags bags 1.0 delay delay 1.0 delay delayed 0.9047619047619047 delay delays 0.9444444444444444 delayed delay 0.9047619047619047 delayed delayed 1.0 delays delay 0.9444444444444444 delays delays 1.0 seat seat 1.0 seat seats 0.9333333333333333 seated seated 1.0 seats seat 0.9333333333333333 seats seat 1.0 The results include a row for every word matched to itself with a Jaro score of 1.0, and two rows for each combination of words with a score of 0.9 or above (one row for each possible word order). Close examination of these results reveals that, while the code has successfully matched some words appropriately (for example, bag / bags, delay / delays, delay / delayed, and seat / seats), it has failed to match some others (for example, delays / delayed, bag / baggage, bagage / baggage, and seat / seated). The analysts experimented with the Jaro value used to filter the results, lowering it to achieve more matches. However, in doing so they found that the number of false positives increased. For example, lowering the filter score to 0.85 matched delays to delayed and seats to seated, but also matched seated to seattle. The results we obtained, and the attempts to improve them by adjusting the matching algorithm, reveal just how difficult it is to infer semantics and sentiment from free-form text. In the end we realized that it would require some type of human intervention, in the form of a manually maintained synonyms list. Using a Lookup File to Combine Matched Words The results generated by the user-defined function were not sufficiently accurate to rely on in a fully automated process for matching words, but they did provide a useful starting point for creating a custom lookup file for words that should be considered synonyms when analyzing the tweet contents. The analysts reviewed and extended the results produced by the function in Excel and created the following tab-delimited text file, which was saved as synonyms.txt in the /twitterdata folder on the HDInsight cluster. Extract from /twitterdata/synonyms.txt bag bag bag bags bag bagage bag baggage bag luggage bag lugage delay delay delay delayed delay delays drink drink drink drinks drink drinking drink beverage drink beverages food food food feed food eat food eating movie movie movie movies movie film movie films seat seat seat seats seat seated seat seating seat sit seat sitting seat sat The first column in this file contains the list of leading values that should be used to aggregate the results. The second column contains synonyms that should be converted to the leading values for aggregation. With this synonyms file in place, the Pig Latin script used to count the words in the tweet contents was modified to use a LEFT OUTER JOIN between the words in the source tweets (after filtering out the noise words) and the words in the synonyms file to find the leading values for each matched word. A UNION clause is then used to combine the matched words with words that are not present in the synonyms file. The revised Pig Latin script is shown here. Pig Latin -- load tweets Tweets = LOAD '/twitterdata/tweets' AS (date, id, author, tweet);
-- split tweet into words TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;
-- clean words by removing trailing periods CleanWords = FOREACH TweetWords GENERATE LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0)) as word;
-- filter text to eliminate empty strings. FilteredWords = FILTER CleanWords BY word != '';
-- load the noise words file NoiseWords = LOAD '/twitterdata/noisewords.txt' AS noiseword:chararray;
-- join the noise words file using a left outer join JoinedWords = JOIN FilteredWords BY word LEFT OUTER, NoiseWords BY noiseword USING 'replicated';
-- filter the joined words so that only words with -- no matching noise word remain UsefulWords = FILTER JoinedWords BY noiseword IS NULL;
-- Match synonyms Synonyms = LOAD '/twitterdata/synonyms.txt' AS (leadingvalue:chararray, synonym:chararray); WordsAndSynonyms = JOIN UsefulWords BY word LEFT OUTER, Synonyms BY synonym USING 'replicated'; UnmatchedWords = FILTER WordsAndSynonyms BY synonym IS NULL; UnmatchedWordList = FOREACH UnmatchedWords GENERATE word; MatchedWords = FILTER WordsAndSynonyms BY synonym IS NOT NULL; MatchedWordList = FOREACH MatchedWords GENERATE leadingvalue as word; AllWords = UNION MatchedWordList, UnmatchedWordList;
-- group by word GroupedWords = GROUP AllWords BY (word);
-- count mentions per group CountedWords = FOREACH GroupedWords GENERATE group as word, COUNT(AllWords) as count;
-- sort by count SortedWords = ORDER CountedWords BY count DESC;
-- limit results to the top 100 Top100Words = LIMIT SortedWords 100;
-- store the results as a file STORE Top100Words INTO '/twitterdata/synonymcounts';
Partial results from this script are shown below. Extract from /twitterdata/synonymcounts/part-r-00000 delay 4812 bag 3777 seat 2404 service 1718 movie 1376 vacation 1375 time 1375 entertainment 1032 food 1030 wifi 1030 connection 688 seattle 688 drink 687
In these results, semantically equivalent words are combined into a single leading value for aggregation. For example the counts for seat, seats, and seated are combined as a single count for seat. This makes the results more useful in terms of identifying topics that are important to customers. For example, it is apparent from these results that the top three subjects that customers have tweeted about are delays, bags, and seats. Moving from Topic Lists to Sentiment Analysis At this point the team has a valuable and seemingly valid list of the most important topics that customers are discussing. However, the results do not include any indication of sentiment. While its reasonably safe to assume that the majority of people discussing delays (the top term) will not be happy customers, its more difficult to detect the sentiment for a top item such as bag or seat. Its not possible to tell from this data if most customers are happy or are complaining. If you are considering implementing a sentiment analysis mechanism, stop and think about how you might categorize text extracts such as Still in Chicago airport due to flight delays - what a great trip this is turning out to be! Its not as easy as it looks, and the results are only ever likely to be approximations of reality. Sentiment analysis is difficult. As an extreme example, a tweet containing text such as no wifi, thanks for that Blue Yonder could be either a positive or a negative sentiment. There are solutions available especially designed to accomplish sentiment analysis, and the team might also investigate using the rudimentary sentiment analysis capabilities of the Twitter search API (see Appendix A for more details). However, they decided that the list of topics provided by the analysis theyve done so far would probably be better employed in driving more focused customer research, such as targeted questionnaires and online surveysperhaps through external agencies that specialize in this field. Phase 3 Stabilizing the Solution With the data processing solution refined to produce useful insights, the analysts at Blue Yonder Airlines decided to stabilize the solution to make it more robust and repeatable. Key concerns about the solution include the dependencies on hard-coded file paths, the data schemas in the Pig Latin scripts, and the data ingestion process. With the current solution, any changes to the location or format of the tweets.txt, noisewords.txt, or synonyms.txt files would break the current scripts. Such changes are particularly likely to occur if the solution gains acceptance among users and Hive tables are created on top of the files to provide a more convenient query interface. Using HCatalog to Create Hive Tables HCatalog provides an abstraction layer between data processing tools such as Hive, Pig, and custom map/reduce code and the file system storage of the data. By using HCatalog to manage data storage the analysts hoped to remove location and format dependencies in Pig Latin scripts, while making it easy for users to access the analytical data directly though Hive. The analysts had already created a table named Tweets for the source data, and now decided to create tables for the noise words file and the synonyms file. Additionally, they decided to create a table for the results generated by the WordCount.pig script to make it easier to perform further queries against the aggregated word counts. To accomplish this the analysts created the following script file containing HiveQL statements, which they saved as CreateTables.hcatalog on the HDInsight server. HiveQL CREATE EXTERNAL TABLE NoiseWords (noiseword STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/twitterdata/noisewords'; CREATE EXTERNAL TABLE Synonyms (leadingvalue STRING, synonym STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/twitterdata/synonyms'; CREATE EXTERNAL TABLE TopWords (word STRING, count BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/twitterdata/topwords'; To execute this script, the following command was executed in the HCatalog installation directory. Command Line hcat.py f C:\Scripts\CreateTables.hcatalog After the script had successfully created the new tables, the noisewords.txt and synonyms.txt files were moved into the folders used by the corresponding tables by executing the following commands. Command Line hadoop dfs mv /twitterdata/noisewords.txt /twitterdata/noisewords/noisewords.txt hadoop dfs mv /twitterdata/synonyms.txt /twitterdata/synonyms/synonyms.txt Using HCatalog in a Pig Latin Script Moving the noisewords.txt and synonyms.txt files made the data available by querying the corresponding Hive tables; but broke the WordCount.pig script, which includes a hard-coded path to these files. The script also uses hard-coded paths to load the tweets.txt file from the folder used by the Tweets table, and to save the results. This dependency on hard-coded paths makes the data processing solution vulnerable to changes in the way that data is stored. For example, if at a later date an administrator modifies the Tweets table to partition the data, the code to load the source data would no longer work. Additionally, if the source data was modified to use a different format or schema, the script would need to be modified accordingly. One of the major advantages of using HCatalog is that I can relocate data as required without breaking all the scripts and code that accesses the files. To eliminate the dependency, the analysts modified the WordCount.pig script to use HCatalog classes to load and save data in Hive tables instead of accessing the source files directly. The modified script is shown below. Pig Latin -- load tweets using HCatalog Tweets = LOAD 'Tweets' USING org.apache.hcatalog.pig.HCatLoader();
-- split tweet into words TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;
-- clean words by removing trailing periods CleanWords = FOREACH TweetWords GENERATE LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0)) as word;
-- filter text to eliminate empty strings. FilteredWords = FILTER CleanWords BY word != '';
-- load the noise words file using HCatalog NoiseWords = LOAD 'NoiseWords' USING org.apache.hcatalog.pig.HCatLoader();
-- join the noise words file using a left outer join JoinedWords = JOIN FilteredWords BY word LEFT OUTER, NoiseWords BY noiseword USING 'replicated';
-- filter the joined words so that only words with -- no matching noise word remain UsefulWords = FILTER JoinedWords BY noiseword IS NULL;
-- Match synonyms using data loaded through HCatalog Synonyms = LOAD 'Synonyms' USING org.apache.hcatalog.pig.HCatLoader(); WordsAndSynonyms = JOIN UsefulWords BY word LEFT OUTER, Synonyms BY synonym USING 'replicated'; UnmatchedWords = FILTER WordsAndSynonyms BY synonym IS NULL; UnmatchedWordList = FOREACH UnmatchedWords GENERATE word; MatchedWords = FILTER WordsAndSynonyms BY synonym IS NOT NULL; MatchedWordList = FOREACH MatchedWords GENERATE leadingvalue as word; AllWords = UNION MatchedWordList, UnmatchedWordList;
-- group by word GroupedWords = GROUP AllWords BY (word);
-- count mentions per group CountedWords = FOREACH GroupedWords GENERATE group as word, COUNT(AllWords) as count;
-- sort by count SortedWords = ORDER CountedWords BY count DESC;
-- limit results to the top 100 Top100Words = LIMIT SortedWords 100;
-- store the results as a file using HCatalog STORE Top100Words INTO 'TopWords' USING org.apache.hcatalog.pig.HCatStorer(); The script no longer includes any hard-coded paths to data files or schemas for the data as it is loaded or stored, and instead uses HCatalog to reference the Hive tables created previously. To execute this script the following commands are used to set the HCAT_HOME environment variable and use HCatalog from Pig. Command Line SET HCAT_HOME=C:\apps\dist\hcatalog-0.4.1 Pig useHCatalog C:\Scripts\WordCount.pig The results of the script are stored in the TopWords table, and can be viewed by executing a HiveQL query such as the following example. HiveQL SELECT * FROM TopWords; Finalizing the Data Ingestion Process Having stabilized the solution, the team finally examined ways to improve and automate the way that data is extracted from Twitter and loaded into Windows Azure blob storage. The solution they adopted uses SQL Server Integration Services with a package that connects to the Twitter streaming API every day to download new tweets, and uploads these as a file to blob storage. For more information about the three approaches that the Blue Yonder Airlines team experimented with for ingesting the source data, see Appendix A of this guide. Fine Tuning the Solution The Blue Yonder Airlines team also plans to further refine the process over time by experimenting with some common optimization techniques. Areas that the team will investigate include: Using the SEQUENCEFILE format for the data instead of text. For more information about using Sequence Files, see the section Resolving Performance Limitations by Optimizing the Query in Chapter 4 of this guide. Applying compression to the data in the intermediate stages of the query, and compressing the source data. For more information about using compression, see the section Resolving Performance Limitations by Compressing the Data in Chapter 4 of this guide. Loading the data directly into Windows Azure blob storage instead of using the LOAD DATA command. For more information about loading Hive tables, see the section Loading Data into Hive Tables in Chapter 4 of this guide. Reviewing the columns used in the query. For example, using only the Tweet column rather than storing other columns that are not used in the current process. This might be possible by adding a Hive staging table and preprocessing the data. However, the team needs to consider that the data in other columns may be useful in the future.
Consuming the Results Now that the solution has generated some useful information, the team can decide how to consume the results for analysis or reporting. Storing the results of the Pig data processing jobs generated by the WordCount.pig script in a Hive table makes it easy for analysts to consume the data from Excel. In time we might create our own custom tools to analyze the results and generate business information by combining it with other data we have available in our BI systems. We are also considering using a custom application to generate charts and graphics that can be embedded in other applications. Because we can access the results in Hive using the ODBC driver, there is almost no limit to what we can achieve! To enable access to Hive from Excel, the ODBC driver for Hive has been installed on all of the analysts workstations and a data source name (DSN) has been created for Hive on the HDInsight cluster. The analysts can use the Data Connection Wizard in Excel to connect to HDInsight using the DSN, and import data from the TopWords table, as shown in Figure 4.
Figure 4 Using the Data Connection Wizard to access a Hive table from Excel After the data has been imported into a worksheet, the analysts can use the full capabilities of Excel to explore and visualize it, as shown in Figure 5.
Figure 5 Visualizing results in Excel Future integration of the data with the companys existing BI system is planned. To see examples of how you can integrate data from HDInsight with an existing BI system, see the scenario Enterprise BI Integration in this guide and in the related Hands-on Labs available for this guide. Summary In this scenario you have seen an example of a Big Data analysis process that includes obtaining source data and uploading it to Windows Azure blob storage; processing the data by using custom map/reduce code, Pig, and Hive; stabilizing the solution using HCatalog; and analyzing the results of the data processing in the HDInsight interactive console and in Microsoft Excel. The scenario demonstrates how Big Data analysis is often an iterative process. It starts with open-ended exploration of the data, which is then refined and stabilized to create an analytical solution. In the scenario youve seen in this chapter the exploration led to information about the topics that are important to customers. From here, further iterations could aggregate the word counts by date to identify trends over periods of time, or more complex semantic analysis could be performed on the tweet content to determine customer sentiment. A key benefits of using HDInsight for this kind of data exploration is that the Windows Azure-based cluster can be created, expanded, and deleted as required; enabling you to manage costs by paying for only the processing and storage resources that you use, and eliminate the need to configure and manage your own hardware and software resources or modify existing internal BI infrastructure. If you would like to work through the solution to this scenario yourself, you can download the files and a step- by-step guide. These are included in the associated Hands-on Labs available from http://wag.codeplex.com/releases/view/103405.
Scenario 2: Extract, Transform, Load In this scenario HDInsight is used to perform an Extract, Transform, and Load (ETL) process on data to validate it for correctness, detect possible fraudulent transactions, and then populate a database table. Specifically, this scenario demonstrates how to: Use custom map/reduce components to extract and transform XML formatted data. Use Hive queries to validate data and detect anomalies. Use Sqoop to load the final data into a Windows Azure SQL Database table. Encapsulate ETL tasks in an Oozie workflow. Use the Microsoft .NET SDK for Hadoop to automate an ETL workflow in HDInsight.
This chapter describes the scenario in general, and the solutions applied to meet the goals of the scenario. If you would like to work through the solution to this scenario yourself, you can download the files and a step-by-step guide. These are included in the related Hands- on Labs available from http://wag.codeplex.com/releases/view/103405. Introduction to Northwind Traders This scenario is based on a fictitious company named Northwind Traders that has stores located in several cities in North and South America. These stores sell a wide range of goods to trade customers that have accounts with the company, and the stores also accept payment from retail customers in cash and as credit/debit card payments. Northwind is a franchise operation, and so does not directly own or control the stores. Each store has its own internal accounting mechanisms, but must report transactions daily to Northwind so that it can monitor the operations of the business centrally, understand trading patterns, and plan for future growth. The administrative team at the Northwind Traders head office is responsible for processing the lists of transactions and storing them in the companys central database. As part of this processing, the team will ensure that the data is valid and also attempt to identify any fraudulent activity. ETL Process Goals and Data Sources Northwind is primarily a management and marketing organization, and it has a considerable presence on the web. To minimize costs and to accommodate future business expansion the management team at Northwind decided to take advantage of the elastic scalability of Windows Azure, and uses a range of Windows Azure services to host all of its IT operations. This includes using Windows Azure SQL Database to store the corporate data. Each night, all of the Northwind stores submit a summary of the days transactions electronically. This is an XML document that is captured by the head office and stored in Windows Azure blob storage in an HDInsight cluster managed by the administrators at the head office. The head office maintains a list of stores, together with the opening hours of each one, and uses this information to generate a unique store ID for each transaction before loading the data into the head office SQL Database. As part of the processing of each transaction file, the head office also wants to verify that the sales occurred during store opening hours in order to detect any fraudulent activity. In addition, any sale over $10,000 in total value is to be flagged in the database so that audit reports can identify these transactions for possible investigation. After the data has been transformed, validated, and loaded into the database the head office team will be able to use Windows Azure SQL Reporting and other reporting tools to examine the data, generate reports, create visualizations, power business dashboards, and provide other management information. These tasks are not explored in this scenario, which concentrates only on the ETL process.
For more information about how you can generate reports and visualizations from data after loading it into a database, see Chapter 5 of this guide. The ETL Workflow To understand the ETL workflow requirements the administrative team at Northwind Traders started by examining the source format of the store transaction data, and the schema of the target Windows Azure SQL Database table, to identify the tasks that must be performed to transfer the data from source to target. The Source Schema The source files contain XML representations of transactions for a specific store on a specific day, as shown in the following example. XML <?xml version="1.0" encoding="utf-8"?> <transactions> <transaction> <date>1/1/2013 10:25:56 AM</date> <location>1AREPE</location> <cashsale> <saleAmount>8849.38</saleAmount> </cashsale> </transaction> <transaction> <date>1/1/2013 10:26:08 AM</date> <location>1AREPE</location> <storecardfirstissue> <cardNumber>xxxx-xxxx-xxxx-5809</cardNumber> </storecardfirstissue> </transaction> <transaction> <date>1/1/2013 10:26:41 AM</date> <location>1AREPE</location> <creditaccountsale> <account>891116</account> <accountLoc>1LIMPE</accountLoc> <saleAmount>7295.49</saleAmount> </creditaccountsale> </transaction> </transactions> The XML source includes a <transaction> element for each transaction, which includes the date and time in a <date> element and the store location in a coded form that specifies the region-specific store number, the city, and the country or region. There are multiple possible transaction types including cash sales, credit sales, and store card issues, each with their own set of properties.
Notice that the different types of transactions have different schemas. For example, store card sales include the card number, while credit account sales include the account number. Cash sales include neither of these values. The processing mechanism we implement must be able to cope with this variability and generate output that conforms to our target schema.
The Target Schema The target table for the ETL process is a Windows Azure SQL Database table created with the following Transact- SQL code. Transact-SQL CREATE TABLE [dbo].[StoreTransaction] ( [TransactionLocationId] [int] NULL, [TransactionDate] [date] NULL, [TransactionTime] [time](7) NULL, [TransactionType] [nvarchar](256) NULL, [Account] [nvarchar](256) NULL, [AccountLocationId] [int] NULL, [CardNumber] [nvarchar](256) NULL, [Amount] [float] NULL, [Description] [nvarchar](256) NULL, [Audit] [tinyint] NULL ); GO
CREATE CLUSTERED INDEX [IX_LocationId_Date_Time] ON [dbo].[StoreTransaction] ( [TransactionLocationId] ASC, [TransactionDate] ASC, [TransactionTime] ASC ); GO Notice that the target table includes a numeric TransactionLocationId column (of type integer), so the alphanumeric store code in the source data must be converted to a unique numeric identifier during the ETL process. The process can look up the store identifiers based on the country or region, city, and local store number in local reference tables. Additionally, the target table includes separate TransactionDate and TransactionTime columns for the date and time of the transaction, which must be derived from the <date> value in the source XML. Transaction records that have missing values for the store location or the date cannot be loaded into the target table, and must be redirected to an alternative location where they can be reviewed by a data steward. Finally, notice that the target table includes an Audit column that indicates rows that should be audited by a manager, based on business rules about transactions that occur outside standard store opening hours or that involve amounts above a threshold of $10,000. The ETL process must identify these rows and set the Audit flag value accordingly.
A clustered index has been created on the table. In addition to improving query performance, a clustered index is required for tables that will be loaded using Sqoop.
The ETL Tasks The transformations that must be applied to the data as it flows through the ETL process can be performed by loading the data into a sequence of Hive tables, and using HiveQL queries to filter and transform the data from one table as it is loaded into the next. To use this approach the XML data must be parsed and loaded into the first Hive table in the sequence, and then a Sqoop command can be used at the end of the process to transfer the data in the final table into the Windows Azure SQL Database table. The sequence of tasks for the ETL workflow is shown in Figure 1.
Figure 1 An overview of the Northwind ETL process Parsing the XML Source Data The first task is to parse the XML source data so that it can be exposed for processing as a Hive table. The developers at Northwind Traders considered how this could be accomplished. The source data files contain a series of <transaction> elements within a <transactions> root element. However, the files are typically too large for a standard XML parser to read into memory. This problem could be overcome by using a streaming XML parser, or by reading the document in chunks as partial XML segments, and attempting to parse each chunk. However, this means that a single error in the file, or if the division into chunks does not fall between <transaction> elements, may cause loss of data and unpredictable results. It also adds complexity to the solution. The developers decided that a better approach would be to split the incoming data into separate valid XML documents, one for each transaction, where each document contains a single <transaction> element as the root element. Each of these documents can then be parsed using a standard XML parser, and the code can handle any that are invalid without affecting the processing of subsequent transactions. To split the source files into individual transaction documents the developers created a custom input formatter, written using Java. This input formatter can then be used in a range of ways, depending on future requirements. For example, it could be called from a user-defined function that is used in a Hive or a Pig query, or it could be used in custom map components. It could also be used directly in a Hive query that uses XPath to parse the contents of the XML documents. After considering the options the developers decided that, in order to provide greater flexibility and extensibility, they would create custom map and reduce components written in Java. The mapper component uses the custom input formatter to split the file into individual transaction records (each a valid XML document), and then uses an XML parser to extract the values of the child elements in each separate transaction document. The extracted values are written out to a file in tab-delimited format. Figure 2 shows an overview of the stages of parsing the XML document. The subsequent sections of this chapter describe each stage of this process.
Figure 2 Parsing the XML into a Hive table using a custom input formatter and mapper=
When you are designing custom components for map/reduce jobs its worth considering whether they might be reusable in similar scenarios, and as you refine your data processes. Of course, you should balance the cost and effort required to implement reusability with the possible gains it might provide. The final stage for the current scenario, as shown in Figure 2, is to define a Hive table over the collection of tab- delimited files, and execute queries against this table to process the data as required. This approach means that the developers can use the tab-delimited file for other purposes if required, including defining different Hive schemas over it or processing it using Pig or another query mechanism. The Custom Input Formatter The custom input formatter splits the transaction data into records based on the <transaction> elements in the XML source files. It outputs a series of records, one for each <transaction> element. With the exception of the initial record and records where there is an error in the source data, each of these records is a valid XML document with the <transaction> root element, and contains the XML child elements and values in the source.
Input formatters do not create new documents or files, but simply instruct the mapper on how to read the input file. The custom input formatter used in this example tells a map/reduce job how to read an XML document, in the same way that the default text input formatter tells any map/reduce job how to read a text file. The custom input formatter is implemented as a class named XMLInputFormat that extends FileInputFormat<LongWritable, Text> and overrides the createRecordReader method to return an instance of the custom XmlRecordReader class. Java package northwindETL; ... public class XMLInputFormat extends FileInputFormat<LongWritable, Text> {
@Override public RecordReader<LongWritable, Text> createRecordReader( InputSplit inputSplit, TaskAttemptContext context) { return new XmlRecordReader(); } The custom XmlRecordReader class extends RecordReader<LongWritable, Text>, and overrides its members to generate the required output from the source data. The XmlRecordReader class reads the source data and splits it into sections, then outputs each section as a separate record. Java public static class XmlRecordReader extends RecordReader<LongWritable, Text> { FileSplit fSplit; byte[] delimiter; long start; long length; long counter = 0; FSDataInputStream inputStream; DataOutputBuffer buffer = new DataOutputBuffer(); LongWritable key = new LongWritable(); Text value = new Text(); ... The override of the initialize method instantiates a FileSplit instance, reads the configuration, opens the file, and sets the file pointer at the start of the input file. Java @Override public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { String endTag = "</transaction>"; delimiter = endTag.getBytes("UTF8"); fSplit = (FileSplit) split;
// Get position of the first byte in the file to process start = fSplit.getStart();
// Get length length = fSplit.getLength();
//Get the file containing split's data Path file = fSplit.getPath();
FileSystem fs = file.getFileSystem(context.getConfiguration()); inputStream = fs.open(file); inputStream.seek(start); } The override of the nextKeyValue method, which is executed when the RecordReader must generate the next record, reads the source data and outputs the section of the source data up to and including the next </transaction> tag. Java @Override public boolean nextKeyValue() throws IOException, InterruptedException { int delimiterLength = delimiter.length;
Notice that, for simplicity in this example, the input formatter actually just divides the input into records that end with the closing </transaction> tag. This means that the first record will start with the initial <transactions> root element tag. You will see later how this is handled in the mapper component. The code also overrides several other methods including getCurrentKey and getCurrentValue to get a key and value from the data, getProgress to indicate how much of the data has be processed so far, and close to close the input file.
The complete source code for the custom input formatter, and for the other Java components described in this chapter, is included in the Hands-on Labs associated with this guide. To obtain the labs go to http://wag.codeplex.com/releases/view/103405. The files are in the folder named Lab 2 - Extract Transform Load\Labfiles\CustomCode\src\northwindETL.
The Custom Mapper Component The custom input formatter converts the source file into a series of records, each of which should be a valid XML document with a <transaction> root element. This allows the mapper component to use an XML parser to read each document in turn and generate the output required. In this scenario, the output is tab-delimited text records, each of which contains the values from one <transaction> document. The custom ParseTransactionMapper class extends the Hadoop Mapper class and implements the map method that is called by Hadoop to generate the output from the map stage. The map method instantiates a DocumentBuilder (which has a StringReader pointing to the input XML as the InputSource), parses the document into a new Document instance, and instantiates a StringBuilder to hold the output that it will generate. Notice how the component strips any extraneous leading text from the record before parsing it. This includes the initial <transactions> root element tag, and any text that may remain from the previous record if there was an error in the data. Java package northwindETL; ... public class ParseTransactionMapper extends Mapper<Object, Text, Text, Text> { public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // Remove any extraneous leading text String data = value.toString(); int i = data.indexOf("<transaction>"); String xml = data.substring(i, data.length());
try { DocumentBuilder db = dbf.newDocumentBuilder(); InputSource is = new InputSource(); is.setCharacterStream(new StringReader(xml));
try { Document doc = db.parse(is); StringBuilder sbk = new StringBuilder(); ... The map method then walks through the document extracting the values based on the known schema, and adds each value to the StringBuilder separated by a tab character. The data contains records that may contain only some of the values; for example, a cash sale does not include an account number or a credit card number. Where these optional values are not present the code simply outputs a tab character. This means that all the output rows have the same number of columns, with empty values for the missing values in the input records.
Our custom input formatter and mapper components are specific to the format of the source data because the differing schemas of each transaction type make it much more difficult and costly to create configurable, general purpose components. The code also performs some translations on the data. For example, after extracting the location code for the transaction, the code splits the data and time into separate columns, as shown here. Java ... sbk.append(extractLocation(doc)); sbk.append('\t'); Date date; try { date = new SimpleDateFormat("M/d/yyyy h:mm:ss a") .parse(extractDate(doc));
The code uses methods named extractLocation and extractDate to extract the specific items from the XML document. These methods, together with similar methods to extract other values, are located elsewhere in the source code file. To build the remaining part of each output record the code uses another StringBuilder, to which it adds the remaining columns for the record. Finally, it outputs the contents of the two StringBuilder instances to the context. Java ... StringBuilder sbv = new StringBuilder(); sbv.append(extractTransactionType(doc)); sbv.append('\t'); sbv.append(extractAccount(doc)); sbv.append('\t'); sbv.append(extractAccountLocation(doc)); sbv.append('\t'); sbv.append(extractCardNumber(doc)); sbv.append('\t'); sbv.append(extractAmount(doc)); sbv.append('\t'); sbv.append(extractDescription(doc));
context.write(new Text(sbk.toString()), new Text(sbv.toString())); The remainder of the code is concerned with error handling, and the definition of the methods that extract specific values from the source document. You can examine the entire source file in the Hands-on Labs associated with this guide Executing the Map/Reduce Job The following code is used to execute the map/reduce job and generate the delimited text version of the transaction data. Notice that it specifies the custom ParseTransactionMapper as the mapper class for the job, and the custom XMLInputFormat class as the input formatter for the job. Java package northwindETL; ... import northwindETL.XMLInputFormat;
public class XMLProcessing { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "XML Processing"); job.setJarByClass(XMLProcessing.class); job.setMapperClass(ParseTransactionMapper.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(XMLInputFormat.class); MultiXmlElementInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } The map/reduce code is packaged as a .jar file and is executed on the cluster. The Hive Statement After the tab-delimited text file has been created by the map/reduce job, the following HiveQL statement can be used to define a Hive table over the results. Where there are empty values in the records in the text file, the associated columns in the row will contain a null value. HiveQL CREATE EXTERNAL TABLE TransactionRaw ( TransactionLocation STRING, TransactionDate STRING, TransactionTime STRING, TransactionType STRING, Account STRING, AccountLocation STRING, CardNumber STRING, Amount DOUBLE, Description STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED by '9' STORED AS TEXTFILE LOCATION '/etl/parseddata'; Normalizing the Store Codes The location IDs in the source data are specific to the geographical region where the transaction occurred. The location ID consists of an encoded value that indicates the city-specific store number, the city, and the country or region where the store is located. Globally unique IDs for each store are maintained in a Hive table, with the city and the country or region details stored in two more tables. These lookup tables are based on the following HiveQL schema definitions. HiveQL CREATE EXTERNAL TABLE Store (Id INT, CityId INT, LocalId INT, LocalCode STRING, Description STRING, OpeningTime STRING, ClosingTime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED by '9' STORED AS TEXTFILE LOCATION '/etl/lookup/store/';
CREATE EXTERNAL TABLE City (Id INT, CountryRegionId INT, LocalId INT, LocalCode STRING, Description STRING) ROW FORMAT DELIMITED FIELDS TERMINATED by '9' STORED AS TEXTFILE LOCATION '/etl/lookup/city/';
CREATE EXTERNAL TABLE CountryRegion (Id INT, Code STRING, Description STRING) ROW FORMAT DELIMITED FIELDS TERMINATED by '9' STORED AS TEXTFILE LOCATION '/etl/lookup/countryregion/';
A requirement to perform some transformation on values in the source data before storing it in your BI system is typical of most ETL processes, especially where the source data comes from a legacy system or a system that was not designed to support the data interchange process you require. In our case, the store identifiers in the source data have a completely different format to those in our BI system, and must be transformed using lookup tables. The ETL workflow must look up the correct unique store ID for each transaction and load the resulting data into a staging table that is based on the following HiveQL schema definition.
To accomplish this the following HiveQL code is used to query the TransactionRaw table and insert the results into the TransactionStaging table.
HiveQL SET hive.auto.convert.join = true;
INSERT INTO TABLE TransactionStaging SELECT TR.TransactionLocation, ST.Id, TR.TransactionDate, TR.TransactionTime, TR.TransactionType, TR.Account, TR.AccountLocation, ST2.Id, TR.CardNumber, TR.Amount, TR.Description FROM TransactionRaw TR LEFT OUTER JOIN CountryRegion CR ON substr(TR.TransactionLocation,length(TR.TransactionLocation)-1,2) = CR.Code LEFT OUTER JOIN city CT ON substr(TR.TransactionLocation,length(TR.TransactionLocation)-4,3) = CT.LocalCode AND CR.Id = CT.CountryRegionId LEFT OUTER JOIN store ST ON substr(TR.TransactionLocation,1,length(TR.TransactionLocation)-5) = ST.LocalCode AND CT.Id = ST.CityId LEFT OUTER JOIN CountryRegion CR2 ON substr(TR.AccountLocation,length(TR.AccountLocation)-1,2) = CR2.Code LEFT OUTER JOIN city CT2 ON substr(TR.AccountLocation,length(TR.AccountLocation)-4,3) = CT2.LocalCode AND CR2.Id = CT2.CountryRegionId LEFT OUTER JOIN store ST2 ON substr(TR.AccountLocation,1,length(TR.AccountLocation)-5) = ST2.LocalCode AND CT2.Id = ST2.CityId;
Redirecting Invalid Rows A transaction is considered invalid if any of the following is true: The transaction location ID is null. The transaction date is empty. The transaction time is empty. The account location is not empty, but the account location ID is null.
Rows in which any of these validation rules fails must be redirected to a separate table for investigation later. The table for invalid rows is based on the following HiveQL schema definition. HiveQL CREATE TABLE TransactionInvalid ( TransactionLocation STRING, TransactionDate STRING, TransactionTime STRING, TransactionType STRING, Account STRING, AccountLocation STRING, CardNumber STRING, Amount DOUBLE, Description STRING ) LOCATION '/etl/hive/transactioninvalid/'; To identify invalid rows and load them into the TransactionInvalid table, the ETL process uses the following HiveQL query. HiveQL INSERT INTO TABLE TransactionInvalid SELECT TS.TransactionLocation, TS.TransactionDate, TS.TransactionTime, TS.TransactionType, TS.Account, TS.AccountLocation, TS.CardNumber, TS.Amount, TS.Description FROM TransactionStaging TS WHERE TS.TransactionLocationId IS NULL OR TS.TransactionDate = '' OR TS.TransactionTime = '' OR (TS.AccountLocation <> '' AND TS.AccountLocationId IS NULL); Setting the Audit Flag The final transformation that the ETL process must perform before loading the data into Windows Azure SQL Database is to set the Audit flag. To accomplish this, a Hive table containing an Audit column was created using the following HiveQL code. HiveQL CREATE TABLE TransactionAudit ( TransactionLocationId INT, TransactionDate STRING, TransactionTime STRING, TransactionType STRING, Account STRING, AccountLocationId INT, CardNumber STRING, Amount DOUBLE, Description STRING, Audit TINYINT ) LOCATION '/etl/hive/transactionaudit/'; Valid rows are then loaded into this table from the TransactionStaging table by executing the following HiveQL query. HiveQL SET hive.auto.convert.join = true;
INSERT INTO TABLE TransactionAudit SELECT TS.TransactionLocationId, TS.TransactionDate, TS.TransactionTime, TS.TransactionType, TS.Account, TS.AccountLocationId, TS.CardNumber, TS.Amount, TS.Description, IF(Amount > 10000 OR TS.TransactionTime < ST.OpeningTime OR TS.TransactionTime > ST.ClosingTime, 1, 0) FROM TransactionStaging TS INNER JOIN Store ST ON TS.TransactionLocationId = ST.Id WHERE TS.TransactionLocationId IS NOT NULL AND TS.TransactionDate <> '' AND TS.TransactionTime <> '' AND (TS.AccountLocation = '' OR TS.AccountLocationId IS NOT NULL); We need to identify several types of transactions for further attention. Transactions that take place outside store hours are obviously suspicious, and we also want to monitor those with a total value over a specified amount. These transactions can be marked with an audit flag in the database, allowing managers to easily select them for investigation. However, transactions that are invalid because they have an incorrect store code, or are missing data such as the time and date, cannot be stored in the database because this would violate the schema and table restraints. Therefore we redirect these to a separate Hive table instead.
Loading the Data into Windows Azure SQL Database After the data has been validated and transformed, the ETL process loads it from the folder on which the TransactionAudit table is based into the StoreTransaction table in Windows Azure SQL Database by using the following Sqoop command. Command Line Sqoop export --connect "jdbc:sqlserver://abcd1234.database.windows.net:1433; database=NWStoreTransactions;username=NWLogin@abcd1234; password=Pa$$w0rd;logintimeout=30;" --table StoreTransaction --export-dir /etl/hive/transactionaudit --input-fields-terminated-by \0001 --input-null-non-string \\N Encapsulating the ETL Tasks in an Oozie Workflow Having defined the individual tasks that make up the ETL process, the administrators at Northwind Traders decided to encapsulate the ETL process in an Oozie workflow. By using Oozie, the process can be parameterized and automated; making it easier to reuse on a regular basis. Figure 3 (below) shows an overview of the Oozie workflow that Northwind will implement. Notice that a core requirement, in addition to each of the processing and query tasks, is to be able to define a workflow that will execute these tasks in the correct order and, where appropriate, wait until each one completes before starting the next one. You can see that the workflow contains a series of tasks that are processed in turn. The majority of the tasks use data generated by the previous task, and so must complete before the next one starts. However, one of the tasks, HiveCreateTablesFork, contains four individual tasks to create the Hive tables required by the ETL process. These four tasks can execute in parallel because they have no interdependencies. The only requirement is that all of them must complete before the next task, HiveAuditStoreHoursAndAmount, can begin. Youll see how the workflow for these tasks in defined later in this chapter. If any of the tasks should fail, the workflow executes the kill task. This generates a message containing details of the error, abandons any subsequent tasks, and halts the workflow. As long as there are no errors, the workflow ends when the final task that loads the data into SQL Database has completed.
Figure 3 The Oozie workflow for the Northwind ETL process Defining the Workflow The Oozie workflow, shown below, is defined in a Hadoop Process Definition Language (hPDL) file named workflow.xml. This file contains a <start> and an <end> element that define the where the workflow process starts and ends, and a <kill> element that defines the action to take when prematurely halting the workflow if an error occurs. The remainder of the workflow definition consists of a series of actions that Oozie will execute when processing the workflow. hPDL <workflow-app xmlns="uri:oozie:workflow:0.2" name="ETLWorkflow">
</workflow-app> Notice the <fork> element, which specifies four actions that can be performed in parallel instead of waiting for each one to finish before starting the next. The <join> element a little further down the file indicates the point where all previous tasks must be finished before continuing with the next one. The <join> element specifies that the next task after the <fork> has completed execution is the HiveNormalizeStoreCodes action. It, and the subsequent three tasks, will execute in turn with each one starting only when the previous one has finished. The previous listing omitted the details of each <action> to make it easier to see the structure of the workflow definition file. Well look in more detail at some of the individual actions next. The complete workflow.xml file and the other files described in this chapter such as the Hive scripts are included in the Hands-on Labs associated with this guide. To obtain the labs go to http://wag.codeplex.com/releases/view/103405. The files are in the folder named Lab 2 - Extract Transform Load\Labfiles\OozieWorkflow. Executing the Map/Reduce Code to Parse the Source Data The <start> element in the workflow.xml file specifies that the first action to execute is the one named ParseAction. This action defines the map/reduce job described earlier in this chapter. The definition of the action is shown below. Notice that it specifies all of the parameters required to run the job, including details of the job tracker and the name node, the Java classes to use as the input formatter and the mapper, the name of the root node in the XML source data (required by the input formatter), and the type and location of the input and output files. hPDL <action name='ParseAction'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${outputAsvFullPath}"/> </prepare> <configuration> <property> <name>mapred.reducer.new-api</name> <value>true</value> </property> <property> <name>mapred.mapper.new-api</name> <value>true</value> </property> <property> <name>mapreduce.map.class</name> <value>northwindETL.ParseTransactionMapper</value> </property> <property> <name>mapreduce.inputformat.class</name> <value>northwindETL.MultiXmlElementInputFormat</value> </property> <property> <name>xmlinput.element</name> <value>transaction</value> </property> <property> <name>mapred.output.key.class</name> <value>org.apache.hadoop.io.Text</value> </property> <property> <name>mapred.output.value.class</name> <value>org.apache.hadoop.io.Text</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='HiveCreateTablesFork'/> <error to='fail'/> </action>
Several of the values used by this workflow action are parameters that are set in the job configuration, and are populated when the workflow executes. The syntax ${...} denotes a parameter that is populated at runtime. Youll see how the values are obtained from the job configuration later in this chapter. At the end of every action in the workflow is an <ok> and an <error> element. These define the name of the action to execute next, and the name of the <kill> action to execute if there is an error. The actions in a workflow definition file do not need to be placed in the order they will be executed, because each action defines the one to execute next. However, it makes it easier to understand and debug the workflow if they are defined in the same order as the execution plan!
Creating the Hive Tables The four actions in the workflow that create the Hive tables are fundamentally similar; the only differences are the name of the Hive script to execute, name and location of the table to be created, and the name of the next action to execute. The following extract from the workflow.xml file shows the definition of the HiveCreateRawTable action. hPDL <action name="HiveCreateRawTable"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>default</value> </property> <property> <name>oozie.hive.defaults</name> <value>hive-default.xml</value> </property> </configuration> <script>HiveCreateRawTable.q</script> <param>TABLE_NAME=${outputTable}</param> <param>LOCATION=${outputDir}</param> </hive> <ok to="HiveCreateTablesJoin"/> <error to="fail"/> </action> You can see the name of the Hive script in the <script> element near the end of the definition. Below it are two parameters that specify the table name and the location. As in the ParseAction definition shown earlier, the values for these parameters are obtained from the job configuration using the ${...} syntax. The difference here is that the values from configuration are themselves parameters, which are passed to the Hive scripts at runtime. The following table shows the values of the script name, table name, and table location used in the three other actions that create Hive tables.
Action Script name Table name Location HiveCreateStagingTable HiveCreateStagingTable.q ${stagingTable} ${stagingLocation} HiveCreateInvalidTable HiveCreateInvalidTable.q ${invalidTable} ${invalidLocation} HiveCreateAuditTable HiveCreateAuditTable.q ${auditTable} ${auditLocation} Executing the Hive Queries There are three actions that execute Hive queries to perform the transformations on the data and populate the Hive tables. The first action, HiveNormalizeStoreCodes, is shown here. Again, the <script> element specifies the name of the script to execute, and the <param> elements specify values to be passed to the Hive script at runtime. hPDL <action name="HiveNormalizeStoreCodes"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>default</value> </property> <property> <name>oozie.hive.defaults</name> <value>hive-default.xml</value> </property> </configuration> <script>HiveNormalizeStoreCodes.q</script> <param>RAW_TABLE_NAME=${outputTable}</param> <param>COUNTRY_REGION_TABLE_NAME=${countryRegionTable}</param> <param>CITY_TABLE_NAME=${cityTable}</param> <param>STORE_TABLE_NAME=${storeTable}</param> <param>STAGING_TABLE_NAME=${stagingTable}</param> </hive> <ok to="HiveFilterInvalid"/> <error to="fail"/> </action> The two subsequent actions that execute Hive scripts to populate the tables are HiveFilterInvalid and HiveAuditStoreHoursAndAmount. The HiveFilterInvalid action contains the following definition of the script name and parameters. hPDL <script>HiveFilterInvalid.q</script> <param>STORE_TABLE_NAME=${storeTable}</param> <param>STAGING_TABLE_NAME=${stagingTable}</param> <param>INVALID_TABLE_NAME=${invalidTable}</param> The HiveAuditStoreHoursAndAmount action contains the following definition of the script name and parameters. hPDL <script>HiveAuditStoreHoursAndAmount.q</script> <param>STORE_TABLE_NAME=${storeTable}</param> <param>STAGING_TABLE_NAME=${stagingTable}</param> <param>INVALID_TABLE_NAME=${invalidTable}</param> <param>AUDIT_TABLE_NAME=${auditTable}</param> Loading the SQL Database Table After the HiveAuditStoreHoursAndAmount action has completed, the next to execute is the SqoopTransferData action. This action uses Sqoop to load the data into the Windows Azure SQL Database table. The complete definition of this action is shown here. hPDL <action name="SqoopTransferData"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <arg>export</arg> <arg>--connect</arg> <arg>${connectionString}</arg> <arg>--table</arg> <arg>${targetSqlTable}</arg> <arg>--export-dir</arg> <arg>${auditLocation}</arg> <arg>--input-fields-terminated-by</arg> <arg>\0001</arg> <arg>--input-null-non-string</arg> <arg>\\N</arg> </sqoop> <ok to="end"/> <error to="fail"/> </action> You can see that it also uses several parameters that are set from configuration to specify the connection string of the database, the target table name, and the location of the file generated by the final Hive query in the workflow.
Parameterizing the Workflow Tasks Each of the tasks that require the execution of a HiveQL query references a file with the extension .q; for example, the HiveCreateRawTable task references HiveCreateRawTable.q. These files contain parameterized versions of the HiveQL statements described previously. For example, HiveCreateRawTable.q contains the following code. HiveQL DROP TABLE IF EXISTS ${TABLE_NAME};
The parameter values for each task are specified as variables in workflow.xml, and they are set in a job.properties file as shown in the following example. job.properties file nameNode=asv://nwstore@nwstore.blob.core.windows.net jobTracker=jobtrackerhost:9010 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=/etl/oozieworkflow/ outputAsvFullPath=asv://nwstore@nwstore.blob.core.windows.net/etl/parseddata/ inputDir=/etl/source/ outputDir=/etl/parseddata/ outputTable=TransactionRaw stagingTable=TransactionStaging stagingLocation=/etl/hive/transactionstaging/ invalidTable=TransactionInvalid invalidLocation=/etl/hive/transactioninvalid/ auditTable=TransactionAudit auditLocation=/etl/hive/transactionaudit/ countryRegionTable=CountryRegion cityTable=City storeTable=Store connectionString=jdbc:sqlserver://abcd1234.database.windows.net:1433;database=NWStoreT ransactions;user=NWLogin@abcd1234;password=Pa$$w0rd;encrypt=true;trustServerCertificat e=true;loginTimeout=30; targetSqlTable=StoreTransaction The ability to abstract settings in a separate file makes the ETL workflow more flexible. It can be easily adapted to handle future changes in the environment, such as a requirement to use alternative HDFS folder locations or a different Windows Azure SQL Database instance. Executing the Oozie Workflow To execute the Oozie workflow the administrators upload the workflow files to HDFS and the job.properties file to the local file system on the HDInsight cluster, and then run the following command on the cluster. Command Line oozie job -oozie http://localhost:11000/oozie/ -config c:\ETL\job.properties -run When the Oozie job starts, the command line interface displays the unique ID assigned to the job. The administrators can then view the progress of the job by using the browser on the HDInsight cluster to display job status at http://localhost:11000/oozie/v0/job/the_unique_job_id?show=log. Automating the ETL Workflow with the Microsoft .NET SDK for Hadoop Using an Oozie workflow simplifies ETL execution and makes it easier to repeat the process. However, the administrators must interactively upload the workflow files and use a Remote Desktop connection to the HDInsight cluster to run the workflow each time. To further automate the ETL process, a developer was tasked with creating an ETL process management utility that can be run from an on-premises client computer. Importing the .NET SDK Libraries To create the ETL management application, the developer created a new Microsoft Visual C# console application and used the package manager to import the Microsoft.Hadoop.WebClient library from NuGet. PowerShell install-package Microsoft.Hadoop.WebClient This library includes the Microsoft.Hadoop.WebHDFS and Microsoft.Hadoop.WebClient.OozieClient namespaces, which contain the classes that will be used to upload the workflow files and execute the Oozie job remotely. Defining the Application Settings To provide flexibility, the Oozie workflow was designed to use variables and parameters. To accommodate this flexibility in the management application, the developer added the following entries to the appSettings section of the App.settings file. XML <appSettings> <add key="ClusterURI" value="https://nwstore.azurehdinsight.net:563" /> <add key="HadoopUser" value="NWuser" /> <add key="HadoopPassword" value="!Pa$$w0rd!" /> <add key="StorageKey" value="abcdefghijklmn123456789=" /> <add key="ContainerName" value="nwstore" /> <add key="StorageName" value="nwstore" /> <add key="NameNodeHost" value="asv://nwstore@nwstore.blob.core.windows.net" /> <add key="WorkflowDir" value="/etl/workflow/" /> <add key="InputDir" value="/etl/source/" /> <add key="OutputDir" value="/etl/parseddata/" /> <add key="RawTableName" value="transactionraw" /> <add key="StagingTableName" value="transactionstaging" /> <add key="StagingLocation" value="/etl/hive/transactionstaging/" /> <add key="InvalidTableName" value="transactioninvalid" /> <add key="InvalidLocation" value="/etl/hive/transactioninvalid/" /> <add key="AuditTableName" value="transactionaudit" /> <add key="AuditLocation" value="/etl/hive/transactionaudit/" /> <add key="CountryRegionTable" value="countryregion" /> <add key="CityTable" value="city" /> <add key="StoreTable" value="store" /> <add key="SqlConnectionString" value= "jdbc:sqlserver://abcd1234.database.windows.net:1433; database=NWStoreTransactions;user=NWLogin@abcd1234;password=Pa$$w0rd; encrypt=true;trustServerCertificate=true;loginTimeout=30;" /> <add key="TargetSqlTable" value="StoreTransaction" /> </appSettings> These settings include the properties that were previously set in the job.properties file, as well as the settings required to connect to the HDInsight cluster and the Windows Azure blob storage it uses. Creating the Application Code The code for the application starts by implementing the Main method, in which it asynchronously executes a private method named UploadWorkflowFilesAndExecuteOozieJob. This method executes two other methods named UploadWorkflowFiles and CreateAndExecuteOozieJob, as shown in the following sample. C# namespace ETLWorkflow { ... public class Program { internal static void Main(string[] args) { UploadWorkflowFilesAndExecuteOozieJob().Wait(); }
private static async Task UploadWorkflowFilesAndExecuteOozieJob() { await UploadWorkflowFiles(); await CreateAndExecuteOozieJob(); Console.WriteLine("Press any key to continue..."); Console.ReadKey(); } Uploading the Workflow Files As the name suggests, the UploadWorkflowFiles method uploads the files required to execute the workflow to HDInsight. When compiled, the OozieJob folder (which contains the workflow files and the lib subfolder containing the NorthwindETL.jar map/reduce component) is stored in the same folder as the executable. The method creates an instance of the WebHDFSClient class from the .NET SDK for Hadoop using values loaded from the configuration file, and uses the new instance to delete any existing workflow folder and contents before uploading all of the files located in the local workflow folder. C# private static async Task UploadWorkflowFiles() { try { var workflowLocalDir = new DirectoryInfo(@".\OozieJob");
var hadoopUser = ConfigurationManager.AppSettings["HadoopUser"]; var storageKey = ConfigurationManager.AppSettings["StorageKey"]; var storageName = ConfigurationManager.AppSettings["StorageName"]; var containerName = ConfigurationManager.AppSettings["ContainerName"]; var workflowDir = ConfigurationManager.AppSettings["WorkflowDir"];
var hdfsClient = new WebHDFSClient(hadoopUser, new BlobStorageAdapter(storageName, storageKey, containerName, false));
foreach (var dir in workflowLocalDir.GetDirectories()) { await hdfsClient.CreateDirectory(workflowDir + dir.Name);
foreach (var file in dir.GetFiles()) { await hdfsClient.CreateFile(file.FullName, workflowDir + dir.Name + "/" + file.Name); } } } catch (Exception ex) { throw ex; } } Creating and Executing the Oozie Job Next, the CreateAndExecuteOozieJob method executes. This method reads all of the parameter values required by the workflow from configuration, and uses some to create instances of the OozieHttpClient class and OozieJobProperties class from the .NET SDK for Hadoop. C# private static async Task CreateAndExecuteOozieJob() { var hadoopUser = ConfigurationManager.AppSettings["HadoopUser"]; var hadoopPassword = ConfigurationManager.AppSettings["HadoopPassword"]; var nameNodeHost = ConfigurationManager.AppSettings["NameNodeHost"]; var workflowDir = ConfigurationManager.AppSettings["WorkflowDir"]; var inputDir = ConfigurationManager.AppSettings["InputDir"]; var outputDir = ConfigurationManager.AppSettings["OutputDir"]; var rawTableName = ConfigurationManager.AppSettings["RawTableName"]; var stagingTableName = ConfigurationManager.AppSettings["StagingTableName"]; var stagingLocation = ConfigurationManager.AppSettings["StagingLocation"]; var auditTableName = ConfigurationManager.AppSettings["AuditTableName"]; var auditLocation = ConfigurationManager.AppSettings["AuditLocation"]; var invalidTableName = ConfigurationManager.AppSettings["InvalidTableName"]; var invalidLocation = ConfigurationManager.AppSettings["InvalidLocation"]; var countryRegionTable = ConfigurationManager .AppSettings["CountryRegionTable"]; var cityTable = ConfigurationManager.AppSettings["CityTable"]; var storeTable = ConfigurationManager.AppSettings["StoreTable"]; var sqlConnectionString = ConfigurationManager .AppSettings["SqlConnectionString"]; var targetSqlTable = ConfigurationManager.AppSettings["TargetSqlTable"]; var clusterUri = new Uri(ConfigurationManager.AppSettings["ClusterURI"]);
var client = new OozieHttpClient(clusterUri, hadoopUser, hadoopPassword);
var prop = new OozieJobProperties(hadoopUser, nameNodeHost, "jobtrackerhost:9010", workflowDir, inputDir, outputDir); ...
Next, the code populates the Oozie job properties, submits the job, and waits for Hadoop to acknowledge that the job has been submitted. Then it uses a JsonSerialiser instance to deserialize the result and display the job ID, before executing the job by calling the StartJob method of the OozieHttpClient instance. This allows the user to see the job ID that is executing, and monitor the job if required by searching for this ID in the Hadoop administration website pages. C# ... var parameters = prop.ToDictionary(); parameters.Add("oozie.use.system.libpath", "true"); parameters.Add("outputTable", rawTableName); parameters.Add("stagingTable", stagingTableName); parameters.Add("stagingLocation", stagingLocation); parameters.Add("invalidTable", invalidTableName); parameters.Add("invalidLocation", invalidLocation); parameters.Add("auditTable", auditTableName); parameters.Add("auditLocation", auditLocation); parameters.Add("cityTable", cityTable); parameters.Add("countryRegionTable", countryRegionTable); parameters.Add("storeTable", storeTable); parameters.Add("targetSqlTable", targetSqlTable); parameters.Add("outputAsvFullPath", nameNodeHost + outputDir); parameters.Add("connectionString", sqlConnectionString);
var newJob = await client.SubmitJob(parameters);
var content = await newJob.Content.ReadAsStringAsync();
var serializer = new JsonSerializer(); dynamic json = serializer.Deserialize( new JsonTextReader(new StringReader(content))); string id = json.id; Console.WriteLine(id); await client.StartJob(id); Console.WriteLine("Job started!"); }
By using this simple client application the administrators at Northwind Traders can initiate the ETL process from an on-premises client computer with minimal effort. The program could even be scheduled for unattended execution at a regular interval to create a fully automated ETL process for the daily transaction files.
The complete source code for the application is included in the Hands-on Labs associated with this guide. To obtain the labs go to http://wag.codeplex.com/releases/view/103405. The files are in the folder named Lab 2 - Extract Transform Load\Labfiles\ETLWorkflow.
Summary In this scenario you have seen how Northwind Traders implemented an ETL process that handles the transaction data submitted by all of its stores each night. The data arrives as XML documents, and must pass through a workflow that consists of several tasks. These tasks transform the data, validate it, generate the correct store identifiers, detect fraudulent activity or transactions above a specific value, and then store the results in the head office database. You saw how Northwind Traders uses Oozie to implement the workflow. This workflow defines each task, and executes these in turn. While Oozie is not the only solution for implementing a workflow with HDInsight, it is integrated with Hadoop in such a way that it is easy to execute scripts and queries in a controllable and easily configurable way. For an organization that does not use an on-premises database such as SQL Server and associated technologies such as SQL Server Integration Services (SSIS), Oozie is typically a good choice for building workflows for use with HDInsight. As part of the scenario, you saw how the workflow can execute a custom map/reduce operation, create Hive tables, join tables and select data, validate values in the data, and apply criteria to detect specific conditions such as sales over a certain value. You also saw how you can use Sqoop to push the results into a Windows Azure SQL Database instance. Finally, you saw how the classes in the Microsoft .NET SDK for Hadoop make it possible to automate an Oozie workflow. If you would like to work through the solution to this scenario yourself, you can download the files and a step- by-step guide. These are included in the related Hands-on Labs available from http://wag.codeplex.com/releases/view/103405.
Scenario 3: HDInsight as a Data Warehouse In this scenario HDInsight is used as a data warehouse for big data analysis. Specifically, this scenario demonstrates how to: Create a data warehouse in an HDInsight cluster. Load data into Hive tables in a data warehouse. Optimize HiveQL queries. Use an HDInsight-based data warehouse as a source for data analysis.
The scenario also discusses how you can minimize running costs by shutting down the cluster when its not is use. This chapter describes the scenario in general, and the solutions applied to meet the goals of the scenario. If you would like to work through the solution to this scenario yourself you can download the files and a step-by-step guide. These are included in the related Hands-on Labs available from http://wag.codeplex.com/releases/view/103405. Introduction to A. Datum The scenario is based on a fictional company named A. Datum, which conducts research into tornadoes and other weather-related phenomena in the Unites States. In this scenario, data analysts at A. Datum want to use HDInsight as a central repository for historical tornado data in order to analyze and visualize previous tornados, and to try to identify trends in terms of geographical locations and times. Analytical Goals and Data Sources The data analysts at A. Datum have gathered historical data about tornadoes in the United States since 1934. The data includes the date and time, geographical start and end point, category, size, number of fatalities and casualties, and damage costs of each tornado. The goal of the data warehouse is to enable analysts to slice and dice this data and visualize aggregated values on a map. This will act as a resource for further research into severe weather patterns, and support further investigation in the future. The data analysts also implemented a mechanism to continuously collect new data as it is published, and this is added to the existing data by being uploaded to Windows Azure blob storage. However, as a research organization, A. Datum does not monitor the data on a daily basis this is the job of the storm warning agencies in the United States. Instead, A. Datums primary aim is to analyze the data only after significant tornado events have occurred, or on a quarterly basis to provide updated results data for use in their scientific publications. Therefore, to minimize running costs, the analysts want to be able to shut down the cluster when it is not being used for analysis, and recreate it when required. Creating the Data Warehouse A data warehouse is fundamentally a database that is optimized for queries that support data analysis and reporting. Data warehouses are often implemented as relational databases with a schema that optimizes query performance, and aggregation of important numerical measures, at the expense of some data duplication in denormalized tables. When creating a data warehouse for Big Data you can use a relational database engine that is designed to handle huge volumes of data (such as SQL Server Parallel Data Warehouse), or you can load the data into a Hadoop cluster and use Hive tables to project a schema onto the data. In this scenario the data analysts at A. Datum have decided to use a Hadoop cluster provided by the Windows Azure HDInsight service. This enables them to build a Hive-based data warehouse schema that can be used as a source for analysis and reporting in traditional tools such as Microsoft Excel, but which also will enable them to apply Big Data analysis tools and techniques. For example, they can carry out iterative data exploration with Pig and custom map/reduce code, and predictive analysis with Hadoop technologies such as Mahout. Figure 1 shows a schematic view of the architecture when using HDInsight to implement a data warehouse.
Figure 1 Using HDInsight as a Data Warehouse for analysis, reporting, and as a business data source Unlike a traditional relational database, HDInsight allows you to manage the lifetime and storage of tables and indexes (metadata) separately from the data that populates the tables. As you saw in Chapter 3, a Hive table is simply a definition that is applied over a folder containing data. This separation is what enables one of the primary differences between Big Data solutions and relational database. In a Big Data solution you apply a schema when the data is read, rather than when it is written. In this scenario youll see how the capability to apply a schema on read provides an advantage for organizations that need a data warehousing capability where data can be continuously collected, but analysis and reporting is carried out only occasionally. Creating a Database When planning the HDInsight data warehouse, the data analysts at A. Datum needed to consider ways to ensure that the data and Hive tables can be easily recreated in the event of the HDInsight cluster being released and re- provisioned. This might happen for a number of reasons, including temporarily decommissioning the cluster to save costs during periods of non-use, and releasing the cluster in order to create a new one with more nodes in order to scale out the data warehouse. The analysts identified two possible approaches: Save a HiveQL script that can be used to recreate EXTERNAL tables based on HDFS folders where the data is persisted in Windows Azure blob storage. Use a custom Windows Azure SQL Database to host the Hive metadata store.
To test the approaches for recreating a cluster, the analysts provisioned an HDInsight cluster and an instance of Windows Azure SQL Database, and used the steps described in the section Retaining Data and Metadata for a Cluster in Chapter 7 of this guide to use the custom SQL database instance to store Hive metadata instead of the default internal store.
Using a HiveQL script to recreate tables after releasing and re-provisioning a cluster requires no configuration effort, and is effective when the data warehouse will contain only a few tables and other objects. Hive tables can be created as EXTERNAL in custom HDFS folder locations so that the data is not deleted in the event of the cluster being released, and the saved script can be run to recreate the tables over the existing data when the cluster is re-provisioned. The benefit of using a custom SQL Database instance for the Hive metadata store is that, in the event of the cluster being released, the Hive metastore is not deleted. When the cluster is re-provisioned the analysts can update the settings in the hive-site.xml file to restore the connection to the custom metastore. Assuming the tables were created as EXTERNAL tables, the data will remain in place in Windows Azure storage and the data warehouse schema will be restored. To ensure a logical separation from Hive tables used for other analytical processes, the data analysts decided to create a dedicated database for the data warehouse by using the following HiveQL statement: HiveQL CREATE DATABASE DataWarehouse LOCATION '/datawarehouse/database'; This statement creates an HDFS folder named /datawarehouse/database as the default folder for all objects created in the DataWarehouse database. Creating a logical database in HDInsight is a useful way to provide separation between the contents of the database and other items located in the same cluster. Creating Tables The tornado data includes the code for the state where the tornado occurred, as well as the date and time of the tornado. The data analysts want to be able to display the full state name in reports, and so created a table for state names using the following HiveQL statement. HiveQL USE DataWarehouse; CREATE EXTERNAL TABLE States (StateCode STRING, StateName STRING) STORED AS SEQUENCEFILE; Notice that the table is stored in the default location. For the DataWarehouse database this is the /datawarehouse/database folder, and so this is where a new folder named States is created. The table is formatted as a Sequence File. Tables in this format typically provide faster performance than tables in which data is stored as text.
You must use EXTERNAL tables if you want the data to be persisted when you delete a table definition or when you recreate a cluster. Storing the data in SEQUENCEFILE format is also a good idea as it can improve performance. The data analysts also want to be able to aggregate data by temporal hierarchies (year, month, and day) and create reports that show month and day names. Hive does not support a datetime data type, and provides no functions to convert dates to constituent parts. While many client applications used to analyze and report data support this kind of functionality, the data analysts want to be able to generate reports without relying on specific client application capabilities. To support date-based hierarchies and reporting, the data analysts created a table containing various date attributes that can be used as a lookup table for date codes in the tornado data. The creation of a date table like this is a common pattern in relational data warehouses. HiveQL USE DataWarehouse; CREATE EXTERNAL TABLE Dates (DateCode STRING, CalendarDate STRING, DayOfMonth INT, MonthOfYear INT, Year INT, DayOfWeek INT, WeekDay STRING, Month STRING) STORED AS SEQUENCEFILE; Finally, the data analysts created a table for the tornado data itself. Since this table is likely to be large, and many queries will filter by year, they decided to partition the table on a Year column, as shown in the following HiveQL statement. HiveQL USE DataWarehouse; CREATE EXTERNAL TABLE Tornadoes (DateCode STRING, StateCode STRING, EventTime STRING, Category INT, Injuries INT, Fatalities INT, PropertyLoss DOUBLE, CropLoss DOUBLE, StartLatitude DOUBLE, StartLongitude DOUBLE, EndLatitude DOUBLE, EndLongitude DOUBLE, LengthMiles DOUBLE, WidthYards DOUBLE) PARTITIONED BY (Year INT) STORED AS SEQUENCEFILE; Loading Data into the Data Warehouse The tornado source data is in tab-delimited text format, and the data analysts have used Excel to create lookup tables for dates and states in the same format. To simplify loading this data into the Sequence File formatted tables created previously, they decided to create staging tables in Text File format to which the source data will be uploaded; and then use a HiveQL INSERT statement to load data from the staging tables into the data warehouse tables. This approach also has the advantage of enabling the data analysts to query the staging tables and validate the data before loading it into the data warehouse.
We followed standard data warehousing practice by loading the data into staging tables where it can be validated before being added to the data warehouse.
Creating the Staging Tables The following HiveQL statements were used to create the staging tables in the default database. HiveQL CREATE EXTERNAL TABLE StagedStates (StateCode STRING, StateName STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/staging/states';
CREATE EXTERNAL TABLE StagedDates (DateCode STRING, CalendarDate STRING, DayOfMonth INT, MonthOfYear INT, Year INT, DayOfWeek INT, WeekDay STRING, Month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/staging/dates';
CREATE EXTERNAL TABLE StagedTornadoes (DateCode STRING, StateCode STRING, EventTime STRING, Category INT, Injuries INT, Fatalities INT, PropertyLoss DOUBLE, CropLoss DOUBLE, StartLatitude DOUBLE, StartLongitude DOUBLE, EndLatitude DOUBLE, EndLongitude DOUBLE, LengthMiles DOUBLE, WidthYards DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/staging/tornadoes'; Loading Static Data The state data consists of a state code and state name for each US state (plus the District of Columbia and Puerto Rico), and is unlikely to change. Similarly, the date data consists of a row for each day between January 1 st 1934 and December 31 st 2013. It will be extended to include another years worth of data at the end of each year, but will otherwise remain unchanged. These tables can be loaded using a one-time manual process; the source data file for each of the staging tables is uploaded to the appropriate folder on which the table is based, and the following HiveQL statements are used to load data from the staging tables into the data warehouse tables. HiveQL SET mapred.output.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
FROM StagedStates s INSERT INTO TABLE DataWarehouse.States SELECT s.StateCode, s.StateName;
FROM StagedDates s INSERT INTO TABLE DataWarehouse.Dates SELECT s.DateCode, s.CalendarDate, s.DayOfMonth, s.MonthOfYear, s.Year, s.DayOfWeek, s.WeekDay, s.Month; Notice that these statements apply compression to the output of the queries, which improves performance and reduces the time taken to load the tables. Implementing a Repeatable Load Process Unlike states and dates, the tornado data will be loaded as frequently as new tornado data becomes available. This requirement made it desirable to implement a repeatable, automated solution for loading tornado data. To accomplish this, the data analysts enlisted the help of a software developer to create a utility for loading new tornado data. The .NET SDK for Hadoop contains several classes that are very useful for automating a range of operations related to working with HDInsight. The developer used Microsoft Visual C# to implement a Windows console application that automates the load process, leveraging the WebHDFSClient and WebHCatHttpClient classes from the Microsoft .NET SDK for Hadoop. The application consists of a single Program class containing the following code: C# using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using Microsoft.Hadoop.WebHDFS; using Microsoft.Hadoop.WebHDFS.Adapters; using Microsoft.Hadoop.WebHCat.Protocol;
static void Main(string[] args) { UploadFile(@"C:\Data\Tornadoes.txt", @"/staging/tornadoes"); LoadTable().Wait(); Console.WriteLine("\nPress any key to continue."); Console.ReadKey(); }
static void UploadFile(string source, string destination) { var storageAdapter = new BlobStorageAdapter(asv_account, asv_key, asv_container, true); var HDFSClient = new WebHDFSClient(hadoop_user, storageAdapter); Console.WriteLine("Deleting existing files..."); HDFSClient.DeleteDirectory(destination, true).Wait(); HDFSClient.CreateDirectory(destination).Wait(); Console.WriteLine("Uploading new file..."); HDFSClient.CreateFile(source, destination + "/data.txt").Wait(); }
private static async Task LoadTable() { var webHCat = new WebHCatHttpClient(new System.Uri(cluster_uri), hadoop_user, hadoop_password); webHCat.Timeout = TimeSpan.FromMinutes(5); Console.WriteLine("Starting data load into data warehouse..."); // IMPORTANT - THE FOLLOWING STATEMENT SHOULD BE ON A SINGLE LINE. string command = @"SET mapred.output.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; SET hive.exec.dynamic.partition.mode=nonstrict; FROM StagedTornadoes s INSERT INTO TABLE DataWarehouse.Tornadoes PARTITION (Year) SELECT s.DateCode, s.StateCode, s.EventTime, s.Category, s.Injuries, s.Fatalities, s.PropertyLoss, s.CropLoss, s.StartLatitude, s.StartLongitude, s.EndLatitude, s.EndLongitude, s.LengthMiles, s.WidthYards, SUBSTR(s.DateCode, 1, 4) Year;"; var job = await webHCat.CreateHiveJob(command, null, null, "LoadDWData", null); var result = await job.Content.ReadAsStringAsync(); Console.WriteLine("Job ID: " + result); } } }
The line defining the command string in the code above must be all on the same line, and not broken over several lines as weve had to do here due to page width limitations. This code uses the WebHDFSClient class to delete the /staging/tornadoes directory on which the StagedTornadoes table is based, and then recreates the directory and uploads the new data file. The code then uses the WebHCatHttpClient class to start an HCatalog job that uses a HiveQL statement to insert the data from the staging table into the Tornadoes table in the data warehouse.
You can download the Microsoft .NET SDK For Hadoop from Codeplex or install it with NuGet. Optimizing Query Performance Having loaded the data warehouse with data, the data analysts set about exploring how to optimize query performance. The data warehouse will be used as a source for multiple reporting and analytical activities, so understanding how to reduce the time it takes to retrieve results from a query is important. Using Hive Session Settings to Improve Performance Hive includes a number of session settings that can affect the way a query is processed, so the data analysts decided to start by investigating the effect of these settings. The exploration started with the baseline query in the following code sample.
HiveQL SELECT s.StateName, d.MonthOfYear, SUM(t.PropertyLoss) PropertyCost, SUM(t.CropLoss) CropCost FROM DataWarehouse.States s JOIN DataWarehouse.Tornadoes t ON t.StateCode = s.StateCode JOIN DataWarehouse.Dates d ON t.DateCode = d.DateCode WHERE t.Year = 2010 GROUP BY s.StateName, d.MonthOfYear; The data analysts executed the baseline query and observed how long it took to complete. They then explored the effect of compressing the map output and intermediate data generated by the map and reduce jobs (which are created by Hive to process the query) by using the following code.
HiveQL SET mapred.compress.map.output=true; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; SET hive.exec.compress.intermediate=true;
SELECT s.StateName, d.MonthOfYear, SUM(t.PropertyLoss) PropertyCost, SUM(t.CropLoss) CropCost FROM DataWarehouse.States s JOIN DataWarehouse.Tornadoes t ON t.StateCode = s.StateCode JOIN DataWarehouse.Dates d ON t.DateCode = d.DateCode WHERE t.Year = 2010 GROUP BY s.StateName, d.MonthOfYear;
This revised query resulted in an improved execution time, which improved further as the volume of data in the tables increased. The analysts concluded that applying compression settings is likely to be beneficial for typical data warehouse queries involving large volumes of data.
Many optimization techniques in HDInsight have a major impact only when you have large volumes of data to process; typically tens of Gigabytes. With small datasets you may even see some degradation of performance from typical optimization techniques. When examining the query logs for the baseline query, the analysts noted that the job used a single reducer. They decided to investigate the effect of using multiple reducers by explicitly setting the number, as shown in the following code.
HiveQL SET mapred.reduce.tasks=2;
SELECT s.StateName, d.MonthOfYear, SUM(t.PropertyLoss) PropertyCost, SUM(t.CropLoss) CropCost FROM DataWarehouse.States s JOIN DataWarehouse.Tornadoes t ON t.StateCode = s.StateCode JOIN DataWarehouse.Dates d ON t.DateCode = d.DateCode WHERE t.Year = 2010 GROUP BY s.StateName, d.MonthOfYear;
An alternative approach to controlling the number of reducers would be to enforce a limit on the number of bytes allocated to a single reducer by setting the hive.exec.reducers.bytes.per.reducer property.
With small volumes of data, and clusters with only a few nodes, this approach offered no reduction in execution times; and in some cases was detrimental to query performance. However, with larger volumes of data and multi-node clusters the analysts found that forcing the use of multiple reducers could sometimes improve performance. The baseline query includes two joins in which a large table (Tornadoes) is joined to two smaller tables (States and Dates). Hive includes a built-in optimization named map-joins, which is designed to improve query performance in exactly this scenario. To force a map-join the analysts modified the query as shown in the following code sample.
HiveQL SET hive.auto.convert.join=true; SET hive.smalltable.filesize=25000000;
SELECT s.StateName, d.MonthOfYear, SUM(t.PropertyLoss) PropertyCost, SUM(t.CropLoss) CropCost FROM DataWarehouse.Tornadoes t JOIN DataWarehouse.Dates d ON t.DateCode = d.DateCode JOIN DataWarehouse.States s ON t.StateCode = s.StateCode WHERE t.Year = 2010 GROUP BY s.StateName, d.MonthOfYear;
Map-join optimization works by loading the smaller table into memory, reducing the time taken to perform lookups when joining keys from the larger table. In practice, the analysts found that the technique improved performance significantly when working with extremely large volumes of data and queries that typically take tens of minutes to execute, but the effect was less noticeable with small to medium tables where the execution time is already within minutes or seconds.
For more information about using Hive settings to optimize query performance, see Denny Lees blog article Optimizing Joins running on HDInsight Hive on Azure at GFS.
Using Indexes Indexes are a common way to improve query performance in relational databases, and Hive provides extensible support for indexing Hive tables in an HDInsight cluster. The data analysts therefore decided to investigate the effect of indexes on query performance. The baseline query for this investigation is shown in the following code sample. HiveQL SELECT s.StateName,SUM(t.PropertyLoss) PropertyCost, SUM(t.CropLoss) CropCost FROM DataWarehouse.States s JOIN DataWarehouse.Tornadoes t ON t.StateCode = s.StateCode JOIN DataWarehouse.Dates d ON t.DateCode = d.DateCode WHERE d.MonthOfYear = 6 AND t.Year = 2010 GROUP BY s.StateName;
The Tornadoes table is partitioned on the Year column, which should reduce the time taken for queries that filter on year. However, the baseline query also filters on month, which is represented by the MonthOfYear column in the Dates table. The data analysts used the following HiveQL statements to create and build an index on this column. HiveQL USE DataWarehouse;
CREATE INDEX idx_MonthOfYear ON TABLE Dates (MonthOfYear) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD;
ALTER INDEX idx_MonthOfYear ON Dates REBUILD;
After the index had been built, the baseline query was re-executed to determine the effect of the index. As with other optimization techniques discussed previously the effect was most noticeable when querying large volumes of data. When the table contained a small amount of data the index had little or no effect, and in some cases proved detrimental to performance.
For more information about optimizing query and transformation performance see the section Tuning the Solution in Chapter 4 of this guide.
Analyzing Data from the Data Warehouse Having built and populated the data warehouse and investigated techniques for improving query performance, the data analysts were ready to start analyzing the data in the data warehouse tables. Analyzing Data in Excel The data analysts typically use Excel for analysis, so the first task they undertook was to create a PowerPivot data model in an Excel workbook. The data model uses the Hive ODBC driver to import data from the data warehouse tables in the HDInsight cluster and defines relationships and hierarchies that can be used to aggregate the data, as shown in Figure 2.
Figure 2 A PowerPivot data model based on a Hive data warehouse
The business analysts could then use the data model to explore the data by creating a PivotTable showing the total cost of property and crop damage by state and time period, as shown in Figure 3.
Figure 3 Analyzing the data with a PivotTable Analyzing Data with GeoFlow The data includes geographic locations as well as the date and time of each tornado, making it ideal for visualization using the GeoFlow add-in for Excel. GeoFlow enables analysts to create interactive tours that show summarized data on a map, and animate it based on a timeline. The analysts created a GeoFlow tour that consist of a single scene with the following two layers: Layer 1: Accumulating property and crop damage costs shown as a stacked column chart by state. Layer 2: Tornado category shown as a heat map by latitude and longitude. The GeoFlow designer for the tour is shown in Figure 4.
Figure 4 Designing a GeoFlow tour
GeoFlow is a great tool for generating interactive and highly immersive presentations of data, especially where the data has a date or time component that allows you to generate a changes over time animation.
The scene in the GeoFlow tour was then configured to animate movement across the map as the timeline progresses. While the tour runs, the property and crop losses are displayed as columns that accumulate each month, and the tornados are displayed as heat maps based on the tornado category (defined in the Enhanced Fujita scale), as shown in Figure 5.
Figure 5 Viewing a GeoFlow tour After the tour has finished the analysts can explore the final costs incurred in each state by zooming and moving around the map viewer. Summary In this scenario you have seen how A. Datum created a data warehouse in an HDInsight cluster by using a Hive database and tables. You have also seen how the analysts loaded data into the data warehouse and used various techniques to optimize query performance before consuming the data in the data warehouse from an analytical tool such as Excel. If you would like to work through the solution to this scenario yourself, you can download the files and a step- by-step guide. These are included in the related Hands-on Labs available from http://wag.codeplex.com/releases/view/103405.
Scenario 4: Enterprise BI Integration This scenario explores ways in which Big Data processing with HDInsight can be integrated into an enterprise business intelligence (BI) solution. The emphasis in this scenario is on the challenges and techniques associated integrating data from HDInsight into a BI ecosystem based on Microsoft SQL Server and Office technologies; including integration at the report, corporate data model, and data warehouse layers of the BI solution. The data ingestion and processing elements of the example used in this scenario have been deliberately kept simple in order to focus on the integration techniques. In a real-world solution the challenge of obtaining the source data, loading it to the HDInsight cluster, and using Map/Reduce code, Pig, or Hive to process it before consuming it in a BI infrastructure are likely to be more complex than described in this scenario. For detailed guidance about ingesting and processing data with HDInsight, see the main chapters and other scenarios in this guide.
This chapter describes the scenario in general and the solutions applied to meet the goals of the scenario. If you would like to work through the solution to this scenario yourself you can download the files and a step-by-step guide. These are included in the related Hands-on Labs available from http://wag.codeplex.com/releases/view/103405. Introduction to Adventure Works Adventure Works is a fictional company that manufactures and sells bicycles and cycling equipment. Adventure Works sells its products through an international network of resellers and also through its own e-commerce site, which is hosted in an Internet Information Services (IIS)-based datacenter. As a large-scale multinational company, Adventure Works has already made a considerable investment in data and BI technologies to support formal corporate reporting, as well as for business analysis. The Existing Enterprise BI Solution The BI solution at Adventure Works is built on an enterprise data warehouse hosted in SQL Server Enterprise Edition. SQL Server Integration Services (SSIS) is used to refresh the data warehouse with new and updated data from line of business systems. This includes sales transactions, financial accounts, and customer profile data.
The high-level architecture of the Adventure Works BI solution is shown in Figure 1.
Figure 1 The Adventure Works enterprise BI solution
The enterprise data warehouse is based on a dimensional design in which multiple dimensions of the business are conformed across aggregated measures. The dimensions are implemented as dimension tables, and the measures are stored in fact tables at the lowest common level of grain (or granularity).
A partial schema of the data warehouse is shown in Figure 2.
Figure 2 Partial data warehouse schema
Figure 2 shows only a subset of the data warehouse schema. Some tables and columns have been omitted for clarity.
The data warehouse supports a corporate data model, which is implemented as a SQL Server Analysis Services (SSAS) tabular database. This data model provides a cube for analysis in Microsoft Excel and reporting in SQL Server Reporting Services (SSRS), and is shown in Figure 3.
Figure 3 An SSAS tabular data model
The SSAS data model supports corporate reporting, including a sales performance dashboard that shows a scorecard of key performance indicators (KPIs), as shown in Figure 4. Additionally, the data model is used by Adventure Works business users to create PivotTables and PivotCharts in Microsoft Excel.
Figure 4 An SSRS report based on the SSAS data model
We want to empower our users by giving them access to corporate information that can help them to make better business decisions. For this reason we implement a self-service approach by exposing interesting data at the corporate data model level. In addition to the managed reporting and analysis supported by the corporate data model, Adventure Works has empowered business analysts to engage in self-service BI activities, including the creation of SSRS reports with Report Builder and the use of PowerPivot in Excel to create personal data models directly from tables in the data warehouse, as shown in Figure 5.
Figure 5 Creating a personal data model with PowerPivot
If you are not familiar with the terminology and design of enterprise BI systems, you will find the overview in Appendix B of this guide provides useful background information.
The Analytical Goals The existing BI solution provides comprehensive reporting and analysis of important business data. However, the IIS web server farm on which the e-commerce site is hosted generates log files, which are retained and used for troubleshooting purposes but until now have never been considered a viable source of business information. The web servers generate a new log file each day, and each log contains details of every request received and processed by the web site. This provides a huge volume of data that could potentially provide useful insights into customers activity on the e-commerce site. A small subset of the log file data is shown in the following example. Log file contents #Software: Microsoft Internet Information Services 6.0 #Version: 1.0 #Date: 2008-01-01 09:15:23 #Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc- status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer) 2008-01-01 00:01:00 198.51.100.2 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100 2008-01-01 00:04:00 198.51.100.5 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com 2008-01-01 00:05:00 198.51.100.6 - 192.0.0.1 80 GET /product.aspx productid=BL-2036 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com 2008-01-01 00:06:00 198.51.100.7 - 192.0.0.1 80 GET /default.aspx - 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com /product.aspx productid=CN-6137 200 1000 1000 100 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0) www.bing.com
The ability to analyze the log data and summarize website activity over time would help the business measure the amount of data transferred during web requests, and potentially correlate web activity with sales transactions to better understand trends and patterns in e-commerce sales. However the large volume of log data that would need to be processed in order to extract these insights has prevented the company from attempting to include the log data in the enterprise data warehouse. The company has recently decided to use HDInsight to process and summarize the log data so that it can be reduced to a more manageable volume, and integrated into the enterprise BI ecosystem.
Of course, you also need to consider how you will access the log files and load them into HDInsight, especially if they are distributed across a server farm in another domain or an off-premises location. The section Loading Web Server Log Files in Chapter 3 of this guide explores some of the approaches you might follow.
The HDInsight Solution The HDInsight solution is based on a Windows Azure HDInsight cluster, enabling Adventure Works to dynamically increase or reduce cluster resources as required. The IIS log files are uploaded to Windows Azure blob storage each day using a custom file upload utility. The files are then retained in blob storage to support data processing with HDInsight.
For information and guidance on uploading data to HDInsight, and some of the ways you can maximize performance by compressing the data and combining or splitting files, see Chapter 3 of this guide.
Creating the Hive Tables To support the goal for summarization of the data, the BI developer intends to create a Hive table that can be queried from client applications such as Excel. However, the raw source data includes header rows that must be excluded from the analysis, and the tab-delimited text format of the source data will not provide optimal query performance as the volume of data grows. The developer therefore decided to create a temporary staging table based on the uploaded data, and use the staging table as a source for a query that loads the required data into a permanent table that is optimized for the typical analytical queries that will be used. The following HiveQL statement has been used to define the staging table over the folder containing the log files:
HiveQL CREATE TABLE log_staging (logdate STRING, logtime STRING, c_ip STRING, cs_username STRING, s_ip STRING, s_port STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING, sc_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT, cs_User_Agent STRING, cs_Referrer STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '32' STORED AS TEXTFILE LOCATION '/data'; This Hive table defines a schema for the log file, making it possible to use a query that filters the rows in order to load the required data into a permanent table for analysis. Notice that the staging table is based on the /data folder but it is not defined as EXTERNAL, so dropping the staging table after the required rows have been loaded into the permanent table will delete the source files that are no longer required. When designing the permanent table for analytical queries, the BI developer has decided to partition the data by year and month to improve query performance when extracting data. To achieve this, a second Hive statement is used to define a partitioned table. Notice the PARTITIONED BY clause near the end of the following script. This instructs Hive to add two columns named year and month to the table, and to partition the data loaded into the table based on the values inserted into these columns.
Next, a Hive script is used to load the data into the new iis_log Hive table. This script takes the values from the columns in the log_staging Hive table, calculates the values for the year and month of each row, and inserts the rows into the partitioned iis_log Hive table. Recall also that the source log data includes a number of header rows that are prefixed with a # character, which could cause errors or add unnecessary complexity to summarizing the data. To resolve this, the HiveQL statement includes a WHERE clause that ignores rows starting with # so that they are not loaded into the permanent table.
HiveQL SET mapred.output.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.output.compression.codec= org.apache.hadoop.io.compress.GzipCodec; SET hive.exec.dynamic.partition.mode=nonstrict; FROM log_staging s INSERT INTO TABLE iis_log PARTITION (year, month) SELECT s.logdate, s.logtime, s.c_ip, s.cs_username, s.s_ip, s.s_port, s.cs_method, s.cs_uri_stem, s.cs_uri_query, s.sc_status, s.sc_bytes, s.cs_bytes, s.time_taken, s.cs_User_Agent, s.cs_Referrer, SUBSTR(s.logdate, 1, 4) year, SUBSTR(s.logdate, 6, 2) month WHERE SUBSTR(s.logdate, 1, 1) <> '#';
To maximize performance the script includes statements that specify the output from the query should be compressed. For details of how you can use the Hadoop configuration settings in this way see the section Tuning the Solution in Chapter 4 of this guide. The code also sets the dynamic partition mode to nonstrict, enabling rows to be inserted into the appropriate partitions dynamically based on the values of the partition columns.
Indexing the Hive Table To further improve performance the BI developer at Adventure Works decided to index the iis_log table on the date, because most queries will need to select and sort the results by date. The following Hive QL statement creates an index on the logdate column of the table. HiveQL CREATE INDEX idx_logdate ON TABLE iis_log (logdate) AS org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD; When data is added to the iis_log table the index can be updated using the following Hive QL statement. HiveQL ALTER INDEX idx_logdate ON iis_log REBUILD;
Our tests revealed that indexing the tables provided only a small improvement in performance of queries, and that building the index took longer than the time we saved running the query. However, the results depend on factors such as the volume of source data, and so you should experiment to see if indexing is a useful optimization technique in each scenario.
Returning the Required Data The data in HDInsight is now ready for use in the corporate BI system and can be consumed by querying the iis_log table. For example, the following script creates a view containing all of the data for all years and months. HiveQL SELECT * FROM iis_log
Where a more restricted dataset is required the BI developer or business user can use a script that selects on the year and month columns, and transforms the data as required. For example, the following script extracts just the data for the first quarter of 2012, aggregates the number of hits for each day (the logdate column contains the date in the form yyyy-mm-dd), and returns a dataset with two columns: the date and the total number of page hits. HiveQL SELECT logdate, COUNT(*) pagehits FROM iis_log WHERE year = 2012 AND month < 4 GROUP BY logdate Report-Level Integration Now that the log data is encapsulated in a Hive table, HiveQL queries can be used to summarize the log entries and extract aggregated values for reporting and analysis. Initially, business analysts at Adventure Works want to try to find a relationship between the number of website hits (obtained from the web server log data in HDInsight) and the number of items sold (available from the enterprise data warehouse). Since only the business analysts require this combined data, there is no need at this stage to integrate the web log data from HDInsight into the entire enterprise BI solution. Instead, a business analyst can use PowerPivot create a personal data model in Excel specifically for this mash up analysis. Adding Data from a Hive Table to a PowerPivot Model The business analyst starts with the PowerPivot model shown previously in Figure 5, which includes a Date table and an Internet Sales table based on the DimDate and FactInternetSales tables in the data warehouse. To add IIS log data from HDInsight the business analyst uses an ODBC connection based on the Hive ODBC driver, as shown in Figure 6.
The ODBC connection to the HDInsight cluster is typically defined in a data source name (DSN) on the local computer, which makes it easy to define a connection for programs that will access data in the cluster. The DSN encapsulates a connection string such as this: Driver={Hive};DATABASE=default;AUTHENTICATION=3;UID=Username;PWD=Password; HOST=awhdinsight.cloudapp.net;PORT=563;FRAMED=0;BINARY_FETCH=0;BATCH_SIZE=65536
Figure 6 Creating an ODBC connection to Hive on HDInsight
After the connection has been defined and tested, the business analyst uses the following HiveQL query to create a new table named Page Hits that contains aggregated log data from HDInsight. HiveQL SELECT logdate, COUNT(*) hits FROM iis_log GROUP BY logdate
This query returns a single row for each distinct date for which log entries exist, along with a count of the number of page hits that were recorded on that date. The logdate values in the underlying Hive table are defined as text, but the yyyy-mm-dd format of the text values means that the business analyst can simply change the data type for the column in the PowerPivot table to Date; making it possible to create a relationship that joins the logdate column in the Page Hits table to the Date column in the Date table, as shown in Figure 7.
Figure 7 Creating a relationship in PowerPivot
When you are designing Hive queries that return tables you want to integrate with your BI system, you should plan ahead by considering the appropriate format for columns that contain key values you will use in the relationships with existing tables.
With the log data imported into the PowerPivot model, and the relationship between the Page Hits and date tables in place, the business analyst can now use Excel to analyze the data and try to correlate web page site activity in the form of page hits with sales transactions in terms of the number of units sold. Figure 8 shows how the business analyst can use Power View in Excel to visually compare page hits and sales over a six month period and by quarter, determine weekly patterns of website activity, and filter the visualizations to show comparisons for a specific weekday.
Figure 8 Using Power View in Excel to analyze data from HDInsight and the data warehouse
By integrating IIS log data from HDInsight with enterprise BI data at the report level, business analysts can create mash-up reports and analyses without impacting the BI infrastructure used for corporate reporting. However, after using this report-level integration to explore the possibilities of using IIS log data to increase understanding of the business, it has become apparent that the log data could be useful to a wider audience of users than just business analysts, and for a wider range of business processes. Corporate Data Model Integration Senior executives of the e-commerce division want to expose the newly discovered information from the web server log files to a wider audience, and use it in BI-focused business processes. Specifically, they want to include the ratio of web page hits to sales as a core metric that can be included as a key performance indicator (KPI) for the business in a scorecard. The goal is to achieve a hit to sale ratio of around 3% (in other words, three items are sold for every hundred web page hits). Scorecards and dashboards for Adventure Works are currently based on the SSAS corporate data model, which is also used to support formal reports and analytical business processes. The corporate data model is implemented as an SSAS database in tabular mode, so the process to add a table for the IIS log data is similar to the one used to import the results of a HiveQL query into a PowerPivot model. A BI developer uses SQL Server Data Tools to add an ODBC connection to the HDInsight cluster and create a new table named Page Hits based on the following query. Notice that this query includes more columns than the one used previously in the personal data model, making it useful for more kinds of analysis by a wider audience.
HiveQL SELECT logdate, SUM(sc_bytes) sc_bytes, SUM(cs_bytes) cs_bytes, COUNT(*) pagehits FROM iis_log GROUP BY logdate
The fact that we are using SSAS in tabular mode makes it possible to connect to an ODBC source such as Hive. If we had been using SSAS in multidimensional mode we would have had to either extract the data from HDInsight into an OLE DB compliant data source, or base the data model on a SQL Server database that has a remote server connection over ODBC to the Hive tables.
After the Page Hits table has been created and the data imported into the model, the data type of the logdate column is changed to Date and a relationship is created with the Date table in the same way as in the PowerPivot data model discussed earlier. However, one significant difference between PowerPivot models and SSAS tabular models is that SSAS does not create implicit aggregated measures from numeric columns in the same way as PowerPivot does. The Page Hits table contains a row for each date, with the total bytes sent and received, and the total number of page hits for that date. The BI developer created explicit measures to aggregate the sc_bytes, cs_bytes, and pagehits values across multiple dates, based on the data analysis expression (DAX) formulae shown in Figure 9.
Figure 9 Creating measures in an SSAS tabular data model
These measures include DAX expressions that calculate the sum of sc_bytes as a measure named Bytes Sent, the sum of cs_bytes as a measure named Bytes Received, and the sum of page hits as a measure named Hits. Additionally, to support the requirement to track page hits to sales performance against a target of 3%, the following measures have been added to the Internet Sales table. DAX Actual Units:=SUM([Order Quantity]) Target Units:=([Hits]*0.03)
A KPI is then defined on the Actual Units measure, as shown in Figure 10.
Figure 10 Defining a KPI in a tabular data model
When the changes to the data model are complete, it is deployed to an SSAS server and fully processed with the latest data. It can then be used as a data source for reports and analytical tools. For example, Figure 11 shows that the dashboard report used to summarize business performance has been modified to include a scorecard for the hits to sales ratio KPI that was added to the data model.
Figure 11 A report showing a KPI from a corporate data model
Dashboards are a great way to present high-level views of information, and are often appreciated most by business managers because they provide an easy way to keep track of performance of multiple sectors across the organization. By including the ability to drill down into the information, you make the dashboard even more valuable to these types of users.
Additionally, because the data from HDInsight has now been integrated at the corporate data model level, it is more readily available to a wider range of users who may not have the necessary skills and experience to create their own data models or reports from an HDInsight source. For example, Figure 12 shows how a network administrator in the IT department can use the corporate data model as a source for a PivotChart in Excel showing data transfer information for the e-commerce site.
Figure 12 Using a corporate data model in Excel Data Warehouse Integration The IIS log entries include the query string passed to the web page, which for pages that display information about a product includes a productid parameter with a value that matches the relevant product code. Business analysts at Adventure Works would like to use this information to analyze page hits for individual products, and include this information in managed and self-service reports. Details of products are already stored in the data warehouse, but since product is implemented as a slowly changing dimension, the product records in the data warehouse are uniquely identified by a surrogate key and not by the original product code. Although the product code referenced in the query string is retained as an alternate key, it is not guaranteed to be unique if details of the product have changed over time. The most problematic task for achieving integration with BI systems at data warehouse level is typically matching the keys in the source data with the correct surrogate key in the data warehouse tables where changes to the existing data over time prompt the use of an alternate key. Appendix A of this guide explores this issue in more detail.
The requirement to match the product code to the surrogate key for the appropriate version of the product makes integration at the report or corporate data model levels problematic. It is possible to perform complex lookups to find the appropriate surrogate key for an alternate key at the time a specific page hit occurred at any level (assuming both the surrogate and alternate keys for the products are included in the data model or report dataset). However, it is more practical to integrate the IIS log data into the dimensional model of the data warehouse so that the relationship with the product dimension (and the date dimension) is present throughout the entire enterprise BI stack. Changes to the Data Warehouse Schema To integrate the IIS log data into the dimensional model of the data warehouse, the BI developer creates a new fact table in the data warehouse for the IIS log data, as shown in the following Transact-SQL statement. Transact-SQL CREATE TABLE [dbo].[FactIISLog]( [LogDateKey] [int] NOT NULL REFERENCES DimDate(DateKey), [ProductKey] [int] NOT NULL REFERENCES DimProduct(ProductKey), [BytesSent] decimal NULL, [BytesReceived] decimal NULL, [PageHits] integer NULL); The table will be loaded with new log data on a regular schedule as part of the ETL process for the data warehouse. In common with most data warehouse ETL processes, the solution at Adventure Works makes use of staging tables as an interim store for new data, making it easier to coordinate data loads into multiple tables and perform lookups for surrogate key values. A staging table is created in a separate staging schema using the following Transact-SQL statement: Transact-SQL CREATE TABLE staging.IISLog( [LogDate] nvarchar(50) NOT NULL, [ProductID] nvarchar(50) NOT NULL, [BytesSent] decimal NULL, [BytesReceived] decimal NULL, [PageHits] int NULL);
Good practice when regularly loading data into a data warehouse is to minimize the amount of data extracted from each data source so that only data that has been inserted or modified since the last refresh cycle is included. This minimizes extraction and load times, and reduces the impact of the ETL process on network bandwidth and storage utilization. There are many common techniques you can use to restrict extractions to only modified data, and some data sources support change tracking or change data capture (CDC) capabilities to simplify this. In the absence of support in Hive tables for restricting extractions to only modified data, the BI developers at Adventure Works have decided to use a common pattern that is often referred to as a high water mark technique. In this pattern the highest log date value that has been loaded into the data warehouse is recorded, and uses as a filter boundary for the next extraction. To facilitate this, the following Transact-SQL statement is used to create an extraction log table and initialize it with a default value: Transact-SQL CREATE TABLE staging.highwater( [ExtractDate] datetime DEFAULT GETDATE(), [HighValue] nvarchar(200));
INSERT INTO staging.highwater (HighValue) VALUES ('0000-00-00'); Extracting Data from HDInsight After the modifications to the data warehouse schema and staging area have been made, a solution to extract the IIS log data from HDInsight can be designed. There are many ways to transfer data from HDInsight to a SQL Server database, including: Use Sqoop to export the data from HDInsight and push it to SQL Server. Use PolyBase to synchronize data in an HDInsight cluster with a SQL Server PDW database. Use SSIS to extract the data from HDInsight and load it into SQL Server.
PolyBase is available only as part of the SQL Server Parallel Data Warehouse (PDW) appliance. If you are not using PDW you will need to investigate using Sqoop, SSIS, or an alternative solution. In the case of the Adventure Works scenario, the data warehouse is hosted on an on-premises server that cannot be accessed from outside the corporate firewall. Since the HDInsight cluster is hosted externally in Windows Azure, the use of Sqoop to push the data to SQL Server is not a viable option. In addition, the SQL Server instance used to host the data warehouse is running SQL Server 2012 Enterprise Edition, not SQL Server Parallel Data Warehouse (PDW). Polybase is only available for PDW, and so it cannot be used in this scenario. The most appropriate option, therefore, is to use SSIS to implement a package that transfers data from HDInsight to the staging table. Since SSIS is already used as the ETL platform for loading data from other business sources to the data warehouse, this option also reduces the development and management challenges for implementing an ETL solution to extract data from HDInsight. The control flow for an SSIS package to extract the IIS log data from HDInsight is shown in Figure 13.
Figure 13 SSIS control flow for HDInsight data extraction This control flow performs the following sequence of tasks: 1. An Execute SQL task extracts the HighValue value from the staging.highwater table and stores it in a variable named HighWaterMark. 2. An Execute SQL task truncates the staging.IISLog table, removing any rows left over from the last time the extraction process was executed. 3. A Data Flow task transfers the data from HDInsight to the staging table. 4. An Execute SQL task updates the HighValue value in staging.highwater with the highest log date value that has been extracted.
This data flow task in step 3 extracts the data from HDInsight and performs any necessary data type validation and conversion before loading it into a staging table, as shown in Figure 14.
Figure 14 SSIS data flow to extract data from HDInsight
The data flow consists of the following sequence of components: 1. An ODBC Source that extracts data from the iis_log Hive table in the HDInsight cluster. 2. A Data Conversion transformation that ensures data type compatibility and prevents field truncation issues. 3. An OLE DB Destination that loads the extracted data into the staging.IISLog table.
If there is a substantial volume of data in the Hive table the extraction might take a long time, so it might be necessary to set the timeout on the IIS source component to a high value. Alternatively, the refresh of data into the data warehouse could be carried our more often, such as every week. You could create a separate Hive table for the extraction and automate running a HiveQL statement every day to copy the log entries for that day to the extraction table after theyve been uploaded into the iis_page_hits table. At the end of the week you perform the extraction from the (smaller) extraction table, and then empty it ready for next weeks data.
Every SSIS Data Flow includes an Expressions property that can be used to dynamically assign values to properties of the data flow or the components it contains, as shown in Figure 15.
Figure 15 Assigning a property expression In this data flow, the SqlCommand property of the Hive Table ODBC source component is assigned using the following expression. SSIS Expression "SELECT logdate, REGEXP_REPLACE(cs_uri_query, 'productid=', '') productid, SUM(sc_bytes) sc_bytes, SUM(cs_bytes) cs_bytes, COUNT(*) PageHits FROM iis_log WHERE logdate > '" + @[User::HighWaterMark] + "' GROUP BY logdate, REGEXP_REPLACE(cs_uri_query, 'productid=', '')" This expression consists of a HiveQL query to extract the required data, combined with the value of the HighWaterMark SSIS variable to filter the data being extracted so that only rows with a logdate value greater than the highest ones already in the data warehouse are included. Based on the log files for the Adventure Works e-commerce site, the cs_uri_query values in the web server log file contain either the value - (for requests with no query string) or productid=product-code (where product- code is the product code for the requested product). The HiveQL query includes a regular expression that parses the cs_uri_query value and removes the text productid=. The Hive query therefore generates a results set that includes a productid column, which contains either the product code value or -.
The query string example in this scenario is deliberately simplistic in order to reduce complexity. In a real- world solution, parsing query strings in a web server log may require a significantly more complex expression, and may even require a user-defined Java function. After the data has been extracted it flows to the Data Type Conversion transformation, which converts the logdate and productid values to 50-character Unicode strings. The rows then flow to the Staging Table destination, which loads them into the staging.IISLog table.
The document Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS) contains a wealth of useful information about using SSIS with HDInsight.
Loading Staged Data into a Fact Table After the data has been staged, a second SSIS package is used to load it into the fact table in the data warehouse. This package consists of a control flow that contains a single Execute SQL task to execute the following Transact-SQL code. Transact-SQL INSERT INTO dbo.FactIISLog (LogDateKey, ProductKey, BytesSent, BytesReceived, PageHits) SELECT d.DateKey, p.ProductKey, s.BytesSent, s.BytesReceived, s.PageHits FROM staging.IISLog s JOIN DimDate d ON s.LogDate = d.FullDateAlternateKey JOIN DimProduct p ON s.productid = p.ProductAlternateKey AND (p.StartDate <= s.LogDate AND (p.EndDate IS NULL OR p.EndDate > s.LogDate)) ORDER BY d.DateKey; The code inserts rows from the staging.IISLog table into the dbo.FactIISLog table, looking up the appropriate dimension keys for the date and product dimensions. The surrogate key for the date dimension is an integer value derived from the year, month, and day. The LogDateKey value extracted from HDInsight is a string in the format YYYY-MM-DD. SQL Server can implicitly convert values in this format to the Date data type, so a join can be made to the FullDateAlternateKey column in the DimDate table to find the appropriate surrogate DateKey value. The DimProduct dimension table includes a row for None (with the ProductAlternateKey value -), and one or more rows for each product. Each product row has a unique ProductKey value (the surrogate key) and an alternate key that matches the product code extracted by the Hive query. However, because product is a slowly changing dimension there may be multiple rows with the same ProductAlternateKey value, each representing the same product at a different point in time. When loading the product data, the appropriate surrogate key for the version of the product that was current when the web page was requested must be looked up based on the alternate key and the start and end date values associated with the dimension record, so the join for the DimProduct table in the Transact-SQL code includes a clause to check for a StartDate value that is before the log date, and an EndDate value that is either after the log date or null (for the record representing the current version of the product member).
For more details about the use of surrogate keys and slowly changing dimensions in a data warehouse, refer to Appendix B. Using the Data Now that the IIS log data has been summarized by HDInsight and loaded into a fact table in the data warehouse, it can be used in corporate data models and reports throughout the organization. The data has been conformed to the dimensional model of the data warehouse, and this deep integration enables business users to intuitively aggregate IIS activity across dates and products. For example, Figure 16 shows how a user can create a Power View visualization in Excel that includes sales and page view information for product categories and individual products.
Figure 16 A Power View report showing data that is integrated in the data warehouse
The inclusion of IIS server log data from HDInsight in the data warehouse enables it to be used easily throughout the entire BI ecosystem, in managed corporate reports and self-service BI scenarios. For example, a business user can use Report Builder to create a report that includes web site activity as well as sales revenue from a single dataset, as shown in Figure 17.
Figure 17 Creating a self-service report with Report Builder
When published to an SSRS report server, this self-service report provides an integrated view of business data, as shown in Figure 18.
Figure 18 A self-service report based on a data warehouse that contains data from HDInsight Summary This scenario explored some of the techniques available to integrate Big Data processing in HDInsight with an enterprise BI solution. The specific integration techniques that are appropriate for your organization may vary from those shown in this scenario, but in general you should seek to integrate Big Data into your enterprise BI infrastructure at the most appropriate layer in the BI architecture based on your requirements for audience reach, integration into existing business processes, support for self-service reporting, and data manageability. For more guidance about integrating big data into enterprise BI, see Chapter 2 and Appendix B of this guide. If you would like to work through the solution to this scenario yourself, you can download the files and a step-by- step guide. These are included in the related Hands-on Labs available from http://wag.codeplex.com/releases/view/103405.
Appendix A: Data Ingestion Patterns using the Twitter API This appendix demonstrates the three main patterns described in Chapter 3 of the guide for loading data into an HDInsight cluster: Interactive data ingestion. This approach is typically used for small volumes of data and when experimenting with datasets. Automated batch upload. If you decide to integrate an HDInsight query process with your existing systems you will typically create an automated mechanism for uploading the data into HDInsight. Real-time data stream capture. If you need to process data that arrives as a stream you must implement a mechanism that loads this into HDInsight.
The examples in this appendix are related to the Interactive Exploration scenario described in Chapter 8 of the guide, and in the associated Hands-on Lab. They demonstrate how the BI team at the fictitious company Blue Yonder Airlines implemented solutions based on the three ingestion patterns to obtain lists of tweets by using the Twitter API, and load then into Windows Azure blob storage for use in HDInsight queries and for subsequent analysis. About the Twitter API Twitter provides a number of public APIs that you can use to obtain data for analysis. The Twitter REST-based search API provides a simple way to search for tweets that contain specific query terms. To return a collection of tweets that contain a search phrase, you simply specify a parameter named q with the appropriate search string, and optionally an rpp parameter to specify the number of tweets to be returned per page. For example, the following request returns up to 100 tweets containing the term @BlueYonderAirlines in Atom format: http://search.twitter.com/search.atom?q=@BlueYonderAirlines&rpp=100 You can also request the results in RSS format, as shown by the following request: http://search.twitter.com/search.rss?q=@BlueYonderAirlines &rpp=100 You can restrict the range of results by including since or until parameters in the query term. For example, the following request returns tweets from March 1 st 2013 onward: http://search.twitter.com/search.rss?q=@BlueYonderAirlines since:2013-03-01&rpp=100
For more sophisticated filtering you can specify a since_id parameter with the ID of a tweet (usually the ID of the most recent tweet provided in the Atom or RSS response from a previous request).
If you want to analyze customer sentiment, the Twitter API includes semantic logic to categorize tweets and positive or negative based on key terms used in the tweet content. You can filter the results of a search based on sentiment by including an emoticon in the search string. For example, the following request returns up to 100 tweets that mention @BlueYonderAirlines and are positive in sentiment: http://search.twitter.com/search.rss?q=@BlueYonderAirlines :)&rpp=100 Similarly, the following request returns tweets that are negative in sentiment: http://search.twitter.com/search.rss?q=@BlueYonderAirlines :(&rpp=100 In addition to the REST API, the Twitter Streaming API offers a way to consume an ongoing real-time stream of tweets based on filtering criteria. Streaming requests are of the form: https://stream.twitter.com/1.1/statuses/filter.json?track=@BlueYonderAirlines
You will see how the streaming API can be used in the section Capturing Real-Time Data from Twitter with StreamInsight of this appendix.
For more information about Twitter APIs, see the developer documentation on the Twitter web site.
Using Interactive Data Ingestion During the initial stage of the project, the team had decided to create a proof of concept with a small volume of data. They obtained this interactively in Excel by sending requests to the search feature of Twitter, and manually uploaded the results to HDInsight through the interactive console. Interactive Data Collection Using Excel To use Excel to extract a collection of tweets from Twitter the team used the Get External Data option on the Data tab of the ribbon, selecting From Web in the drop-down list. The New Web Query dialog provides a textbox where they specified a URL that corresponds to a REST-based search request, as shown in Figure 1.
Figure 1 Using Excel to extract search results from Twitter
The import process resulted in a table of data with a row for each tweet returned by the query, as shown in Figure 2.
Figure 2 Twitter data in Excel
After the data had been imported, the team at Blue Yonder Airlines removed unnecessary columns and made some formatting changes. To simplify analysis the team decided to include only the publication date, ID, title, and author of each tweet, and to format the publication date using the format DD-MM-YYYY. These changes are easy to make in Excel, and the worksheet was then saved as a tab-delimited text file. This process was repeated on a regular basis to generate sufficient data for validating the results of the initial analysis.
Interactive Data Upload Using the JavaScript Console To upload the text files to the HDInsight cluster for analysis the team used the fs.put() JavaScript command in the interactive console area of the HDInsight dashboard. This requires only the path and name of the files to upload and the destination path and name in HDInsight storage, as shown in Figure 3.
Figure 3 Uploading a file to HDInsight
To view the contents of a folder the team used the #ls command in the HDInsight interactive console, as shown in Figure 4.
Figure 4 Browsing files in the HDInsight interactive console To view the contents of a file after it had been uploaded the team used the #cat command. Uploading Large Files The interactive console is suitable for uploading small files, but it has an upper limit on the file size. To support uploads of large files, and to improve the efficiency of data uploads, the team at Blue Yonder Airlines explored using: The Windows Azure SDK features available for Visual Studio, which provide access to blob storage in Server Manager. A third party utility such as those listed in the page How to Upload Data to HDInsight on the HDInsight website. A custom command-line upload utility.
They chose to create a simple utility that they can run from the command line by using the following syntax: Command Line BlobManager.UploadConsoleClient.exe source blob_store key container [, autoclose] The arguments for the BlobManager.UploadConsoleClient.exe utility are described in the following table: Argument Description source A local file or folder. If a folder is specified, all files in that folder are uploaded. blob_store The name of the Windows Azure Blob Store to which the data is to be uploaded key The passkey used to authenticate the connection to the Windows Azure Blob Store. container The name of the container in the Windows Azure Blob Store to which the data should be uploaded. The uploaded files will be placed in a folder named data within the specified container. autoclose An optional argument. If the value --autoclose is passed, the console application closes automatically after the upload is complete. Otherwise, a prompt to press a key is displayed. The upload utility splits the files into 4MB blocks and uses 16 I/O threads to upload 16 blocks (64MB) in parallel. After each block is uploaded it is marked as committed and the next block is uploaded. When all of the blocks are uploaded they are reassembled into the original files in Windows Azure blob storage.
The source code for the Blob Manager utility that Blue Yonder used is included in the associated Hands-on Labs you can download for this guide. To download the Hands-on Labs go to http://wag.codeplex.com/releases/view/103405. The Blob Manager code is in the Lab 4 - Enterprise BI Integration\Labfiles\BlobUploader subfolder of the sample code. Automating Data Ingestion with SSIS Interactively downloading data from Twitter and uploading it to HDInsight is appropriate for initial proof of concept testing. However, to provide a meaningful dataset the team at Blue Yonder Airlines must obtain and upload data from Twitter regularly over a period of a several weeks. If they determine that the data analysis can provide useful and valid information, they will want to consider automating the data ingestion process. After the initial investigation of the data it was clear that it would be useful to continue the analysis process to provide ongoing insights into the information it contains. To accomplish this, the team created a SQL Server Integration Services (SSIS) package that extracts the data from Twitter and uploads it to Windows Azure blob storage automatically every day. Using SSIS makes it easy to encapsulate a workflow that obtains and uploads the Twitter data in an SSIS package, and the package can be deployed to an SSIS catalog in SQL Server and scheduled for automatic execution at a regular interval.
A set of reusable components for accessing Windows Azure blob storage with SSIS can be obtained from Azure Blob Storage Components for SSIS Sample on Codeplex.
The SSIS project includes the following project-level parameters: Parameter Description Query The search term used to filter tweets from the Twitter search API. LocalFolder The local file system folder in which the search results downloaded from Twitter will be stored. BlobstoreName The name of the Windows Azure blob storage account associated with the HDInsight cluster. BlobstoreKey The passkey required to connect to Windows Azure blob storage. BlobstoreContainer The container in Windows Azure blob storage to which the data will be uploaded. The project contains a package named Tweets.dtsx, which defines the control flow shown in Figure 5.
Figure 5 SSIS control flow for loading Twitter data to HDInsight
The SQL Server Integration Services Package The control flow for the Tweets.dtsx package consists of the following tasks: 1. Empty Local Folder: a File System task that deletes any existing files in the folder defined by the LocalFolder parameter. 2. Get Tweets: a Data Flow task that extracts the data from Twitter and saves it as a local file in the folder defined by the LocalFolder parameter. 3. Upload Tweets to Azure Blob Store: an Execute Process task that uses the command line tool discussed previously to upload the local file to Windows Azure blob storage.
The Get Tweets data flow task is shown in Figure 6. It consists of a Script component named Twitter Search that implements a custom data source, and a Flat File destination named Tweet File.
Figure 6 Data flow to extract data from Twitter
The package uses the following connection managers: Connection Manager Type Connection String TwitterConnection HTTP http://search.twitter.com/search.rss?rpp=100 TweetFileConnection File @[$Project::LocalFolder] + @[System::ExecutionInstanceGUID] + ".txt" (This is an expression that uses the LocalFolder project parameter and the unique runtime execution ID to generate a filename such as C:\data\{12345678-ABCD-FGHI-JKLM-1234567890AB}.txt) The Twitter Search script component is configured to use the TwitterConnection HTTP connection manager, and has the following output fields defined: TweetDate TweetID TweetAuthor TweetText The Script component is granted read-only access to the Query parameter, and includes the following code. C# using System; using System.Data; using Microsoft.SqlServer.Dts.Pipeline.Wrapper; using Microsoft.SqlServer.Dts.Runtime.Wrapper; using System.Xml; using System.ServiceModel.Syndication;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute] public class ScriptMain : UserComponent { private SyndicationFeed tweets = null; private XmlReader tweetReader = null;
public override void PostExecute() { base.PostExecute(); }
public override void CreateNewOutputRows() { if (tweets != null) { foreach (var tweet in tweets.Items) { Output0Buffer.AddRow(); Output0Buffer.TweetDate = tweet.PublishDate.ToString("d"); Output0Buffer.TweetID = tweet.Id; Output0Buffer.TweetAuthor = tweet.Authors[0].Email; Output0Buffer.TweetText = tweet.Title.Text; } Output0Buffer.SetEndOfRowset(); } } } Notice that the code combines the HTTP connection string and the parameterized query term to dynamically generate a REST URI for the Twitter search API that filters the results to include only tweets since the start of the current day. The code also includes a reference to the System.ServiceModel.Syndication namespace, which includes classes that make it easier to work with RSS feeds. The PublishDate, Id, first author Email, and Title fields from each tweet in the RSS results are assigned to the output fields and sent through the data flow to the Tweet File destination, which uses the TweetFileConnection connection manager to write the data to a tab-delimited file in the local file system. After the data flow task has downloaded the data from Twitter, the Execute Process task named Upload Tweets to Azure Blob Store in the control flow uses the custom BlobManager.UploadConsoleClient.exe console application described earlier to upload the file to Windows Azure blob storage. The Execute Process task is configured to use the following expression for the Arguments property (the command line arguments that are passed to the executable): Control Flow Task @[$Project::LocalFolder] + " " + @[$Project::BlobstoreName] + " " + @[$Project::BlobstoreKey] + " " + @[$Project::BlobstoreContainer] + " --autoclose"
After the SSIS project is complete it is deployed it to an SSIS Catalog server, and the Tweets.dtsx package is scheduled to run each day at a specified time with the following parameter values: Parameter Value Query @BlueYonderAirlines LocalFolder c:\data BlobstoreName BlueYonderBlobStore BlobstoreKey 1234567890abcdefghijklmnopqrstuvwxyz&%$@1234567890abcdefghi== BlobstoreContainer blueyonderdata This automatically retrieves data from Twitter and uploads it to HDInsight at a scheduled time each day. Capturing Real-Time Data from Twitter with StreamInsight The SSIS-based solution for ingesting Twitter data into HDInsight provides a way to capture a sample of tweets each day. However, as the analysis of Twitter data is proving useful to the business, a refinement of the project is to capture all of the tweets related to Blue Yonder Airlines in real-time and upload them to HDInsight in frequent batches for more detailed analysis. To accomplish this, the team at Blue Yonder Airlines investigated the Twitter streaming API and the use of Microsoft StreamInsight. The Twitter streaming API returns a stream of tweets in JSON format, which the StreamInsight application must serialize as events that are captured and written to a file. The file can then be uploaded to Windows Azure blob storage when a suitable batch of tweets has been captured. To create the StreamInsight solution, the developers created the following classes: Tweet. An event class that represents a tweet event: C# [Serializable] public class Tweet { public long ID { get; set; } public DateTime CreatedAt { get; set; } public string Text { get; set; } public string Source { get; set; } public string UserLang { get; set; } public int UserFriendsCount { get; set; } public string UserDescription { get; set; } public string UserTimeZone { get; set; } public string UserURL { get; set; } public string UserName { get; set; } public long UserID { get; set; } public long DeleteStatusID { get; set; } public long DeleteUserID { get; set; } } TweetSource. A class that inherits from the IObservable interface, and conceptually acts as a wrapper around the Twitter streaming API. This class contains a Start method, which connects to the Twitter streaming API and submits a request for tweets containing the term @BlueYonderAirlines. Each time a JSON-format tweet is read from the response stream it is deserialized and sent to all observer classes that are registered with this observable class. C# [Serializable] public class TweetSource : IObservable<Tweet> { private const string TrackString = "@BlueYonderAirlines"; private static List<IObserver<Tweet>> observers = new List<IObserver<Tweet>>(); private static string userName; private static string password; private static object sync = new object();
public IDisposable Subscribe(IObserver<Tweet> observer) { lock (sync) { if (observers.Count == 0) { Start(); } observers.Add(observer); } return null; }
private static void Start() { ThreadPool.QueueUserWorkItem( o => { WebRequest request = HttpWebRequest.Create ("https://stream.twitter.com/1.1/statuses/filter.json?track=" + TrackString); request.Credentials = new NetworkCredential(userName, password); var response = request.GetResponse();
using (StreamReader streamReader = new StreamReader(response.GetResponseStream())) { while (true) { string line = streamReader.ReadLine(); if (string.IsNullOrWhiteSpace(line)) { continue; }
JavaScriptSerializer serializer = new JavaScriptSerializer(); Tweet tweet = serializer.Deserialize<Tweet>(line);
foreach (var observer in observers) { observer.OnNext(tweet); } } } }, null); } } TwitterStreamProcessor. A class that encapsulates the StreamInsight complex event processing for the application. The Start method in this class accepts parameters for the user name and password to be used when connecting to Twitter, and an array of delegate methods that should be registered as observers and called each time a tweet is read from the stream. It then: Connects to the StreamInsight server and creates a CEP application named twitterApp. Creates a source based on the TweetSource class described previously, and deploys that to the StreamInsight server. Creates a List class that inherits from the IObserver interface and includes each of the observer delegate methods passed as parameters, which are then deployed it to the StreamInsight application. Defines a LINQ query to filter events as they flow through the StreamInsight engine, ensuring that only tweets that are not flagged for deletion are received by the observer class. C# public class TwitterStreamProcessor { private IDisposable process; private Server streamInsightServer; private Application streamInsightApp;
After implementing the required classes, the Blue Yonder Airlines team created a simple Windows application that calls the Start method of an instance of the TwitterStreamProcessor class, passing the names of functions to be registered as observers. In this case the application uses two observers; one to write tweets to the user interface in real-time, and another to write tweet data to a text file. C# this.twitterStreamProcessor.Start( userNameTextBox.Text, passwordTextBox.Text, WriteTweetOnScreen, SaveToFile); The tweets that are captured by StreamInsight are displayed in the application user interface as shown in Figure 7, and they are also saved in a text file.
Figure 7 Capturing real-time Twitter data with StreamInsight
When a sufficient batch of tweets has been captured, the application uses the BlobManager library from the custom application described earlier to upload the text file to Windows Azure blob storage as shown in the following code:
To discover how the team at Blue Yonder Airlines uses the captured data, see the Iterative Exploration scenario in Chapter 8 of this guide, and the related example in the associated Hands-on Labs you can download from http://wag.codeplex.com/releases/view/103405. The Visual Studio project for capturing real-time data from Twitter is provided in the Labs\Iterative Exploration\Labfiles\TwitterStream folder of the Hands-on Labs examples so that you can experiment with it yourself.
Appendix B: An Overview of Enterprise Business Intelligence Enterprise BI is a topic in itself, and a complete discussion of the design and implementation of a data warehouse and BI solution is beyond the scope of this guide. However, you do need to understand some of its specific features such as dimension and fact tables, slowly changing dimensions, and data cleansing if you intend to integrate a Big Data solution such as HDInsight with your data warehouse. This appendix provides an overview the main factors that make an enterprise BI system different from simple data management scenarios.
Even if you are already familiar with SQL Server as a platform for data warehousing and BI you may like to review the contents of this appendix before reading about integrating HDInsight with Enterprise BI in Chapter 3, and the Enterprise BI Integration scenario described in this guide.
An enterprise BI solution commonly includes the following key elements: Business data sources. An extract, transform, and load (ETL) process. A data warehouse or departmental data marts. Corporate data models. Reporting and analytics.
Reporting and analytics is discussed in depth in Chapter 5 and elsewhere in the guide, and so it is not included in this appendix.
Data flows from business applications and other external sources through an ETL process and into a data warehouse. Corporate data models are then used to provide shared analytical structures such as online analytical processing (OLAP) cubes or data mining models, which can be consumed by business users through analytical tools and reports. Some BI implementations also enable analysis and reporting directly from the data warehouse, enabling advanced users such as business analysts and data scientists to create their own personal analytical data models and reports.
Figure 1 shows the architecture of a typical enterprise BI solution.
Figure 1 Overview of a typical enterprise data warehouse and BI implementation Business Data Sources Most of the data in an enterprise BI solution originates in business applications as the result of some kind of business activity. For example, a retail business might use an e-commerce application to manage a product catalog and process sales transactions. In addition, most organizations use dedicated software applications for managing core business activities, such as a human resources and payroll system to manage employees, and a financial accounts application to record financial information such as revenue and profit. Most business applications use some kind of data store, often a relational database. Additionally, the organization may implement a master data management solution to maintain the integrity of data definitions for key business entities across multiple systems. Microsoft SQL Server 2012 provides an enterprise-scale platform for relational database management, and includes Master Data Services a solution for master data management. ETL One of the core components in an enterprise BI infrastructure is the ETL solution that extracts data from the various business data sources, transforms the data to meet structure, format, and validation requirements, and loads the transformed data into the data warehouse. Typically, the ETL process performs an initial load of the data warehouse tables, and thereafter performs a regularly scheduled refresh cycle to synchronize the data warehouse with changes in the source systems. The ETL solution must therefore be able to detect changes in source systems in order to minimize the amount of time and processing overhead required to extract data, and intelligently insert or update rows in the data warehouse tables based on requirements to maintain historic values for analysis and reporting.
Figure 2 An SSIS data flow In most enterprise scenarios, the ETL process extracts data from business data sources and stores it in an interim staging area before loading the data warehouse tables. This enables the ETL process to consolidate data from multiple sources before the data warehouse load operation, and to restrict access to data sources and the data warehouse to specific time periods that minimize the impact of the ETL processing overhead on normal business activity. SQL Server 2012 includes SQL Server Integration Services (SSIS), which provides a platform for building sophisticated ETL processes. An SSIS solution consists of one or more packages, each of which encapsulates a control flow for a specific part of the ETL process. Most control flows include one or more data flows, in which a pipeline-based, in-memory sequence of tasks is used to extract, transform, and load data. Figure 2 shows a simple SSIS data flow that extracts customer address data and loads it into a staging table. Notice that the data flow in Figure 2 includes a task named DQS Cleansing. This reflects the fact that an important requirement in an ETL solution is that the data being loaded into the data warehouse is as accurate and error-free as possible so that the data used for analysis and reporting is reliable enough to base important business decisions on.
Figure 3 A DQS knowledge base To understand the importance of this, consider an e-commerce site in which users enter their address details in a free-form text box. This can lead to data input errors (for example, mistyping New York as New Yok) and inconsistencies (for example, one user might enter United States, while another enters USA, and a third enters US). While these errors and inconsistencies may not materially affect the operations of the business application (it is unlikely that any of these issues would prevent the delivery of an order), they may cause inaccuracies in reports and analyses. For example, a report showing total sales by country would include three separate values for the United States, which could be misleading when displayed as a chart. SQL Server 2012 includes Data Quality Services (DQS). To help resolve data quality issues, business users with a deep knowledge of the data, who are often referred to as data stewards, can create a DQS knowledgebase that defines common errors and synonyms that must be corrected to the leading values. For example, Figure 3 shows the DQS knowledgebase referenced by the data flow in Figure 2, in which New Yok, New York City, and NYC are all corrected to the leading value New York. The Data Warehouse A data warehouse is a central repository for the data extracted by the ETL processes, and provides a single, consolidated source of data for reporting and analysis. In some organizations the data warehouse is implemented as a single database containing data for all aspects of the business, while in others the data may be factored out into smaller departmental data marts. Data Warehouse Schema Design Regardless of the specific topology used, most data warehouses are based on a dimensional design in which numeric business measures such as revenue, cost, stock units, account balances, and so on are stored in business process-specific fact tables. For example, measures related to sales orders in the retail business discussed previously might be stored in fact table named FactInternetSales. The facts are stored at a specific level of granularity, knows as the grain of the fact table, that reflects the lowest level at which you need to aggregate the data. In the example of the FactInternetSales table, you could store a record for each order, or you could enable aggregation at a lower level of granularity by storing a record for each line item within an order. Measures related to stock management might be stored in a FactStockKeeping table, and measures related to manufacturing production might be stored in a FactManufacturing table. Records representing the business entities by which you want to aggregate the facts are stored in dimension tables. For example, records representing customers might be stored in a table named DimCustomer, and records representing products might be stored in a table named DimProduct. Additionally, most data warehouses include a dimension table to store time period values, for example a row for each day to which a fact record may be related. The dimension tables are conformed across the fact tables that represent the business processes being analyzed. That is to say, the records in a dimension table (for example, DimProduct) represent an instance of a business entity (in this case, a product), that can be used to aggregate facts in any fact table that involves the same business entity. For example, the same DimProduct records can be used to represent products in the business processes represented by the FactInternetSales, FactStockKeeping, and FactManufacturing tables. To optimize query performance, a dimensional model uses minimal joins between tables. This approach results in a denormalized database schema in which data redundancy is tolerated in order to reduce the number of tables and relationships. The ideal dimensional model is a star schema in which each fact table is related directly to its dimension tables as shown in Figure 4.
Figure 4 A star schema for a data warehouse In some cases there may be compelling reasons to partially normalize a dimension. One reason for this is to enable different fact tables to be related to attributes of the dimension at different levels of grain; for example, so that sales orders can be aggregated at the product level while marketing campaigns are aggregated at the product category level. Another reason is to enable the same attributes to be shared across multiple dimensions; for example, storing geographical fields such as country or region, state, and city in a separate geography dimension table that can be used by both a customer dimension table and a sales region dimension table. This partially normalized design for a data warehouse is known as a snowflake schema, as shown in Figure 5.
Figure 5 A snowflake schema for a data warehouse Retaining Historical Dimension Values Note that dimension records are uniquely identified by a key value. This key is usually a surrogate key that is specific to the data warehouse, and not the same as the key value used in the business data source where the dimension data originated. The original business key for the dimension records is usually retained in an alternate key column. The main reason for generating a surrogate key for dimension records is to enable the retention of historical versions of the dimension members. For example, suppose a customer in New York registers with the retailer discussed previously and makes some purchases. After a few years, the customer may move to Los Angeles and then make some more purchases. In the e-commerce application used by the website, the customers profile record is simply updated with the new address. However, changing the address in the customer record in the data warehouse would cause historical reporting of sales by city to become inaccurate. Sales that were originally made to a customer in New York would now be counted as sales in Los Angeles. To avoid this problem, a new record for the customer is created with a new, unique surrogate key value. The new record has the same alternate key value as the existing record, indicating that both records relate to the same customer, but at different points in time. To keep track of which record represents the active instance of the customer at a specific time, the dimension table usually includes a Boolean column to indicate the current record, and often includes start and end date columns that reflect the period to which the record relates. This is shown in the following table. CustKey AltKey Name City Current Start End 1001 212 Jesper Aaberg New York False 2008-01-01 2012-08-01
4123 212 Jesper Aaberg Los Angeles True 2012-08-01 NULL This approach for retaining historical versions of dimension records is known as a slowly changing dimension (SCD). It is important to understand that an SCD might contain multiple dimension records for the same business entity when implementing an ETL process that loads dimension tables, and when integrating analytical data (such as data obtained from HDInsight) with dimension records in a data warehouse.
Strictly speaking, the example shown in the table is a Type 2 slowly changing dimension. Some dimension attributes can be changed without retaining historical versions of the records, and these are known as Type 1 changes.
Implementing a Data Warehouse with SQL Server SQL Server 2012 Enterprise Edition includes a number of features that make it a good choice for data warehousing. In particular, the SQL Server query engine can identify star schema joins and optimize query execution plans accordingly. Additionally, SQL Server 2012 introduces support for columnstore indexes, which use an in-memory compression technique to dramatically improve the performance of typical data warehouse queries. For extremely large enterprise data warehouses that need to store huge volumes of data, Microsoft has partnered with hardware vendors to offer an enterprise data warehouse appliance. The appliance uses a special edition of SQL Server called Parallel Data Warehouse (PDW), in which queries are distributed in parallel across multiple processing and storage nodes in a shared nothing architecture. This technique is known as massively parallel processing (MPP), and offers extremely high query performance and scalability.
For more information about SQL Server Parallel Data Warehouse, see Appliance: Parallel Data Warehouse (PDW) on the SQL Server website. Corporate Data Models A data warehouse is designed to structure business data in a schema that makes it easy to write queries that aggregate data, and many organizations perform analysis and reporting directly from the data warehouse tables. However, most enterprise BI solutions include analytical data models as a layer of abstraction over the data warehouse to improve performance of complex analytical queries, and to define additional analytical structures such as hierarchies of dimension attributes. For example, a data model may aggregate sales by customer at the country or region, state, city, and individual customer levels as well as exposing key performance indicators such as defining a score that compares actual sales revenue to targeted sales revenue. These data models are usually created by BI developers and used as a data source for corporate reporting and analytics. Analytical data models are generically known as cubes, because they represent a multidimensional view of the data warehouse tables in which data can be sliced and diced to show aggregated measure values across multiple dimension members. SQL Server 2012 Analysis Services (SSAS) supports the following two data model technologies: Multidimensional databases. Multidimensional SSAS databases support traditional OLAP cubes that are stored on disk, and are compatible with SSAS databases created with earlier releases of SQL Server Analysis Services. Multidimensional data models support a wide range of analytical features. A multidimensional database can also host data mining models, which apply statistical algorithms to datasets in order to perform predictive analysis. Tabular databases. Tabular databases were introduced in SQL Server 2012 and use an in-memory approach to analytical data modeling. Tabular data models are based on relationships defined between tables, and are generally easier to create than multidimensional models while offering comparable performance and functionality. Tabular models support most (but not all) of the capabilities of multidimensional models. Additionally, they provide an upscale path for personal data models created with PowerPivot in Microsoft Excel.
When you install SSAS, you must choose between multidimensional and tabular modes. An instance of SSAS installed in multidimensional mode cannot host tabular databases, and conversely a tabular instance cannot host multidimensional databases. You can however install two instances of SSAS (one in each mode) on the same server.
Figure 6 shows a multidimensional data model project in which a cube contains a Date dimension. The Date dimension includes two hierarchies, one for fiscal periods and another for calendar periods.
Figure 6 A multidimensional data model project
Figure 7 shows a tabular data model that includes the same dimensions and hierarchies as the multidimensional model in Figure 6.
Figure 7 A tabular data model project
This appendix has described only the key aspects of data warehousing and BI that are necessary to understand how to integrate HDInsight into a BI solution. If you want to learn more about building BI solutions with Microsoft SQL Server and other components of the Microsoft BI platform, consider referring to The Microsoft Data Warehouse Toolkit (Wiley) or attending Microsoft Official Curriculum courses 10777 (Implementing a Data Warehouse with Microsoft SQL Server 2012), 10778 (Implementing Data Models and Reports with Microsoft SQL Server 2012), and 20467 (Designing Business Intelligence Solutions with Microsoft SQL Server 2012). You can find details of all Microsoft training courses at Training and certification for Microsoft products and technologies on the Microsoft.com Learning website.