Académique Documents
Professionnel Documents
Culture Documents
_______________________________________
Page 1
Contents
1. What is Big Data? ...................................................................................................................3 2. Characteristics of Big Data ......................................................................................................3 3. Enter Hadoop.........................................................................................................................3 4. Microsoft Big Data Solutions ..................................................................................................4 5. Microsoft Big Data Solutions for Hadoop ................................................................................4 6. Hadoop Data ImportExport on Microsoft Platform..................................................................5 7. Conclusion .............................................................................................................................6
Page 2
Enter Hadoop
Big data is based on Hadoop. This is the most visible face of Big data. Hadoop is a technology developed in open source Apache community which enables analysis on semi structured and structured data where data is distributed across commodity servers with in a cluster.
Page 3
Two major observations of Hadoop are Storage and processing Storage (Hadoop Distributed File System layer) HDFS is all about storage. It is basically Infrastructure which hadoop implemented to store the data across multiple nodes. HDFS also takes care of application scenarios to make it resilience to server or node failures or job failures. Processing (Map reduce layer) Every query in the Hadoop system is a job and this job creates a bunch of tasks. These tasks will process the data, shuffle the results, reduce them, aggregate them and then bring the results back to the client applications. Map Reduce layer consists of Job tracker and task tracker which works in tandem with HDFS. Beauty of hadoop is it distributes the data and calculates it locally then brings the results back. Hadoop = HDFS + MapReduce
Page 4
On premise distribution: This is a dedicated on premise cluster. If you want to run Hadoop on a standalone or distributed windows servers located in your own infrastructure, then Hadoop services for on premise distribution provides such option. This distribution has great things like integration with your active directory, System center that makes it enterprise manageable setup. On cloud distribution: This is a Dedicated Cloud Cluster. Hadoop services would be hosted on Cloud and will not go away. This distribution provides some elasticity based on how you setup storage and partitioning. Basically this is all about running Hadoop on azure. Hadoop based services for windows azure are available through invitation. More on this can be found at www.hadooponazzure.com. Elastic MapReduce (EMR): Setting up Apache Hadoop in your infrastructure is not simple. EMR makes this easy and also Elastic MapReduce model is ultimate elastic. If you want to get the hadoop cluster up and running very fast, run the calculations, pull back the results then do not care anymore. Its a simple management deployment process. It is all self-managed on the azure but you will be managing actual processing.
Page 5
port 2226 is required if want to remote desktop and work on the Hadoop system. ODBC Server Port is for connecting from end user tools like Power view, Power Pivot and Excel. Once the ports are open, through various interfaces data can be brought in to the cluster. For example from Manage cluster interface, data can be bought in from windows azure blog storage and Data markets. Couples of other ways are JavaScript console and SQOOP connectors. Interacting with Hadoop Traditional way to interact with Hadoop was to write your own jobs written in JAVA. If you are not interested in going to the level of understanding how the MapReduce job been executed then you can use javascript. JavaScript is an interactive development environment along with charting and graphing. HIVE and PIG interfaces would help in interacting with Hadoop. SQOOP connectors would help in transferring data between relational and Hadoop storage systems. HIVE interface would interest to the Data professionals because it uses a language called HIVQL. We can create structures over the data where will create Meta data layer. For example Weblogs would structure in the table format with which you can interact with common tools like Excel, Power View or Power Pivot. HIVE in other words is a Data warehouse system for hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. More details can be found on Hive project, examples, documentation and use cases at http://hive.apache.org
PIG is another data flow language but more like a scripting language to transform and analyze large data sets on Hadoop. It is an abstraction layer to interact with Hadoop. Here you write scripts that get converted to MapReduce Jobs. Those who are familiar with SQL love and embrace HIVE first. Those who are familiar with scripting language would love and embrace PIG first. PIG is a data-flow platform to transform and analyze large data sets on Hadoop. It can be extensible through user defined functions and methods. More can be learned from http://pig.apache.org/ You can use other languages like Perl to get the data and work on. SQOOP helps in pulling data from SQL into Hadoop or pull data from Hadoop into SQL. Every SQOOP connector is bi-directional. It allows executing MapReduce jobs to transfer data in parallel with fault tolerance in the HDFS. These connectors can be downloaded from http://bit/ly/rulsiX .
Conclusion
Three flavored Microsoft Hadoop distributions full fill the various analyses and decision support needs on Unstructured and semi structured data. Integration of Semi and unstructured data with the RDBMS through SQOOP connecters extend the analysis scope. With the various connectors for Microsoft SQL server, Parallel Data warehousing along with end user ODBC connectors, Microsoft made Hadoop accessible to a broader class of end users, developers and IT professionals.
Page 6
Page 7