Vous êtes sur la page 1sur 7

Big Data on Microsoft Platform

_______________________________________

Prepared by GJ Srinivas Corporate TEG - Microsoft

Page 1

Contents
1. What is Big Data? ...................................................................................................................3 2. Characteristics of Big Data ......................................................................................................3 3. Enter Hadoop.........................................................................................................................3 4. Microsoft Big Data Solutions ..................................................................................................4 5. Microsoft Big Data Solutions for Hadoop ................................................................................4 6. Hadoop Data ImportExport on Microsoft Platform..................................................................5 7. Conclusion .............................................................................................................................6

Page 2

What is Big Data?


With the time Data is just growing and the growth is unprecedented. There were times when enterprise level organizations used to deal with few GBs of data which itself was huge and challenging to manage. Now the time has changed where needs of managing data has moved from Terabytes to Petabytes. Big data is data which is Big in size to manage. Its a word applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage and process the data within certain elapsed time. Big data is in some cases new and in some cases it is been there already. It is all about and around data explosion, large volumes but it also has other facets. When we are talking about Big data, it is about structured and unstructured data. In many cases it is unstructured or semi structured like for example weblogs, data coming from machines, plants or energy censers RFID, social data, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, medical records, photography or video archives..etc., There are those types of data on which people want to analyze and make decisions. Gartner says on the world of data "By 2012 organization that build a modern information management system will outperform their peers financially by 20%"

Characteristics of Big Data


There are multiple facets of Big Data which are typically called as 4Vs of Big data. These typically characterize Big Data. Volume: Exceeds Physical limits of vertical scalability Velocity: Decision window small compared to data change rate Variety: Many different formats make integration expensive Variability: Its not same as Variety. Many Options or variable interpretations confound analysis. A typical Big Data patterns are where it starts with holding data in a Digital box like storage where all potentially valuable data retained then analyzed later. Once data stored then mine the data for insight and feed to the downstream systems. Use this Information production as a factory for consumable feeds and discover, enrich and publish the data. Finally optimize the loop by analyzing ambient data, monitor the system and optimize the behavior.

Enter Hadoop
Big data is based on Hadoop. This is the most visible face of Big data. Hadoop is a technology developed in open source Apache community which enables analysis on semi structured and structured data where data is distributed across commodity servers with in a cluster.

Page 3

Two major observations of Hadoop are Storage and processing Storage (Hadoop Distributed File System layer) HDFS is all about storage. It is basically Infrastructure which hadoop implemented to store the data across multiple nodes. HDFS also takes care of application scenarios to make it resilience to server or node failures or job failures. Processing (Map reduce layer) Every query in the Hadoop system is a job and this job creates a bunch of tasks. These tasks will process the data, shuffle the results, reduce them, aggregate them and then bring the results back to the client applications. Map Reduce layer consists of Job tracker and task tracker which works in tandem with HDFS. Beauty of hadoop is it distributes the data and calculates it locally then brings the results back. Hadoop = HDFS + MapReduce

Microsoft Big Data Solutions


Microsoft provided a range of solutions to address big data challenges. Data Warehouse solutions from SQL Server 2008 R2/2012, Fast Track Data Warehouse, Business Data Warehouse, Parallal Data Warehouse offers a robust and scalable platform for storing and analyzing data in a traditional data warehouse. Among these all PDW offers customers: Enterprise-class performance that handles massive volumes to over 600 TB. Microsoft provide LINQ to HPC which enables the development and deployment of data intensive applications using technologies such as .NET and Language Integrated Query (LINQ), to express their data analytics algorithms. LINQ to HPC can integrate with SQL Server 2008 R2 and the portfolio of BI offerings from Microsoft such as SSRS, SSAS, Power Pivot, and Excel.

Microsoft Big Data Solutions for Hadoop


Hadoop is not a Microsoft solution. In order to work with Apache Hadoop, Microsoft has come up with three Hadoop Deployment models on Windows Platform. 1. On premise 2. On Cloud 3. Elastic MapReduce

Page 4

Figure 1: Microsoft Big Data Solution

On premise distribution: This is a dedicated on premise cluster. If you want to run Hadoop on a standalone or distributed windows servers located in your own infrastructure, then Hadoop services for on premise distribution provides such option. This distribution has great things like integration with your active directory, System center that makes it enterprise manageable setup. On cloud distribution: This is a Dedicated Cloud Cluster. Hadoop services would be hosted on Cloud and will not go away. This distribution provides some elasticity based on how you setup storage and partitioning. Basically this is all about running Hadoop on azure. Hadoop based services for windows azure are available through invitation. More on this can be found at www.hadooponazzure.com. Elastic MapReduce (EMR): Setting up Apache Hadoop in your infrastructure is not simple. EMR makes this easy and also Elastic MapReduce model is ultimate elastic. If you want to get the hadoop cluster up and running very fast, run the calculations, pull back the results then do not care anymore. Its a simple management deployment process. It is all self-managed on the azure but you will be managing actual processing.

Hadoop Data ImportExport on Microsoft Platform


Data can be imported or exported from Hadoop storage system in various ways from Microsoft Platform. For proceeding with Data transfer, FTPS and ODBC server ports must be opened where FTPS

Page 5

port 2226 is required if want to remote desktop and work on the Hadoop system. ODBC Server Port is for connecting from end user tools like Power view, Power Pivot and Excel. Once the ports are open, through various interfaces data can be brought in to the cluster. For example from Manage cluster interface, data can be bought in from windows azure blog storage and Data markets. Couples of other ways are JavaScript console and SQOOP connectors. Interacting with Hadoop Traditional way to interact with Hadoop was to write your own jobs written in JAVA. If you are not interested in going to the level of understanding how the MapReduce job been executed then you can use javascript. JavaScript is an interactive development environment along with charting and graphing. HIVE and PIG interfaces would help in interacting with Hadoop. SQOOP connectors would help in transferring data between relational and Hadoop storage systems. HIVE interface would interest to the Data professionals because it uses a language called HIVQL. We can create structures over the data where will create Meta data layer. For example Weblogs would structure in the table format with which you can interact with common tools like Excel, Power View or Power Pivot. HIVE in other words is a Data warehouse system for hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. More details can be found on Hive project, examples, documentation and use cases at http://hive.apache.org

PIG is another data flow language but more like a scripting language to transform and analyze large data sets on Hadoop. It is an abstraction layer to interact with Hadoop. Here you write scripts that get converted to MapReduce Jobs. Those who are familiar with SQL love and embrace HIVE first. Those who are familiar with scripting language would love and embrace PIG first. PIG is a data-flow platform to transform and analyze large data sets on Hadoop. It can be extensible through user defined functions and methods. More can be learned from http://pig.apache.org/ You can use other languages like Perl to get the data and work on. SQOOP helps in pulling data from SQL into Hadoop or pull data from Hadoop into SQL. Every SQOOP connector is bi-directional. It allows executing MapReduce jobs to transfer data in parallel with fault tolerance in the HDFS. These connectors can be downloaded from http://bit/ly/rulsiX .

Conclusion
Three flavored Microsoft Hadoop distributions full fill the various analyses and decision support needs on Unstructured and semi structured data. Integration of Semi and unstructured data with the RDBMS through SQOOP connecters extend the analysis scope. With the various connectors for Microsoft SQL server, Parallel Data warehousing along with end user ODBC connectors, Microsoft made Hadoop accessible to a broader class of end users, developers and IT professionals.

Page 6

References Figure1 source: http://download.microsoft.com/download/1/8/B/18BE3550-D04C-4B3F-9310F8BC1B62D397/MicrosoftBigDataSolutionSheet.pdf

Page 7

Vous aimerez peut-être aussi