Vous êtes sur la page 1sur 45

Module 9: Processing Distributed Data

with Apache Spark

www.edureka.co/big-data-and-hadoop
Course Topics
 Module 1  Module 6
» Understanding Big Data and Hadoop » HIVE

 Module 2  Module 7
» Hadoop Architecture and HDFS » Advance HIVE and HBase

 Module 3  Module 8
» Hadoop MapReduce Framework » Advance HBase

 Module 4  Module 9
» Advance MapReduce
» Processing Distributed Data with Apache Spark
 Module 5
» PIG  Module 10
» Oozie and Hadoop Project

Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this Module, you will be able to understand

 What is Apache Spark?

 Spark Ecosystem

 Spark Components

 History of Spark and Spark Versions/Releases

 Spark a Polyglot

 What is Scala?

 Why Scala ?

 SparkContext

 RDD

Slide 3 www.edureka.co/big-data-and-hadoop
What is Apache Spark?
An in-memory cluster computing platform

Runs lightning fast by performing computations in memory.

Extension of one of the most widely used popular framework - MapReduce

An Unified Stack

Provides rich set of tools like

i) Spark SQL
ii) Spark Streaming
iii) GraphX
iv)MLlib

 Runs on Windows & UNIX systems

Slide 4 www.edureka.co/big-data-and-hadoop
Why Apache Spark?
Ability to run faster analytics of large data in memory across cluster of machines

As an unified stack, makes easy adaptation of new components along with existing components

One of the most powerful feature of spark is it can be easily integrated with Hadoop cluster and Cassandra

Can build and run applications that combines different models

Provides supports for different types of data like json, parquet, hive tables, etc.

An interactive shell framework

Spark streaming can be easily integrated with high throughput frameworks like Kafka

Slide 5 www.edureka.co/big-data-and-hadoop
Spark – Ecosystem

Spark Spark Spark


Spark SQL SparkR
GraphX Streaming MLib

SPARK CORE

Spark Stand Hadoop


Apache Mesos
Alone Cluster Cluster(YARN)

Slide 6 www.edureka.co/big-data-and-hadoop
Spark Components
Spark Core

» An Execution engine of Spark platform SPARK CORE


» All the functionalities of Spark are built on top of core
» Provides API’s to define Resilient Distributed Data Set
Java Scala Python R
» Supports Java, Scala and Python API for developing wide range of
applications

Spark SQL
Data Frame
» Spark package on top of core to work with structured data
» Provides interactive SQL queries to explore data
» Defines DataFrames – Distributed collection of data grouped into SPARK SQL
named columns
» Also supports apache hive SQL(HQL)
SPARK CORE

Slide 7 www.edureka.co/big-data-and-hadoop
Spark Components
Spark GraphX

» Graph computation engine built on top of spark


» Provides an efficient Spark API for building graphs
» Extends RDD of SPARK for graph parallel computation
» Extensive support of algorithms which eases the tasks of graph analytics
Kafka
Spark Streaming
Flume
» Built on top of spark to process and analyze streams of data in real time
» Discretized stream processing HDFS/S3
» Represents stream of data with sequence of RDD’s – DSTREAM
» Can be easily integrated with High throughput low latency frameworks Kinesis
like Kafka ,Flume etc.
Twitter

Slide 8 www.edureka.co/big-data-and-hadoop
Spark Components
Spark MLlib
MLlib
» A scalable machine language library build on top of spark
spark.mllib
» Uses a ScalaNLP[Scientific Computing, Machine Learning, and
spark.ml
Natural Language Processing] library Breeze
» Spark’s ML library built with extensive algorithms including
regression, clustering and a lot of other components SPARK CORE

SparkR

» Integrated with Spark from version 1.4


» SparkR is a R interface built on top of Spark Core Distributed
» Exposes components like Dataframe, RDD’s into R. Data Set
» Provides a Ligth-weight front end

Slide 9 www.edureka.co/big-data-and-hadoop
Spark – A Polyglot
Apache spark can be developed in different languages as shown below

Slide 10 www.edureka.co/big-data-and-hadoop
Spark – History
Developed at UC Berkeley Lab in 2009

Started by Matei Zaharia and open sourced in 2010

Written in functional programming language Scala and uses Akka

Developed along with Mesos – A cluster management framework

In 2013 transferred to Apache software foundation

 Spark went on to became top level apache project in march 2014


Matei Zaharia
Spark developed by targeting interactive iterative computations like machine learning

Slide 11 www.edureka.co/big-data-and-hadoop
Spark Version & Releases edureka!

Spark 0.6.0 Spark 0.8.0 Spark 1.0.0

5 October 2012 25 September 2013 30 May 2014

» Major release of spark with focus on » First release as part of Apache » API Stability, Integration with YARN
performance enhancements » Introduction of Mlib and support for Security, Spark SQL, MLlib, GraphX
running spark in iPython and Streaming improvements,
» Addition of a standalone deploy mode
Extended Java and Python Support
etc.

Slide 12 www.edureka.co/big-data-and-hadoop
Spark Version & Releases edureka!

Spark 1.2.0 Spark 1.4.0 Spark 1.5.0

18 December 2014 11 June 2015 9 Sep 2015

» Performance improvement and » Inclusion of SparkR based on Spark » Sixth release in 1.x of Spark with
stability DataFrame support for Yarn cluster mode in R,
» Enhanced support for Kafka and and introduction of new feature
» Introduction of Python API to Spark
Kinesis transformers
streaming and support through WAL

Slide 13 www.edureka.co/big-data-and-hadoop
Annie’s Question

Question : Apache Spark is not compatible with existing


Hadoop infrastructure
a)True
b)False

Slide 14 www.edureka.co/big-data-and-hadoop
Annie’s Answer

QuestionFalse
Answer:

Slide 15 www.edureka.co/big-data-and-hadoop
Annie’s Question

Question : What are the different clusters supported by


Apache Spark ?
a)Apache Mesos
b)Spark Stand Alone Cluster
c)YARN
d)All of them

Slide 16 www.edureka.co/big-data-and-hadoop
Annie’s Answer

QuestionAll of them
Answer:

Slide 17 www.edureka.co/big-data-and-hadoop
Demo
Download & Configure Spark

Slide 18 www.edureka.co/big-data-and-hadoop
Demo
Invoking Spark Shell

Slide 19 www.edureka.co/big-data-and-hadoop
Slide 20 www.edureka.co/big-data-and-hadoop
What is Scala?
A general-purpose scalable programming language

 Aimed to implement common programming patterns in a concise, elegant, and type-safe way

Supports both object-oriented and functional programming styles, thus helping programmers to be more productive

Publicly released in January 2004 on the JVM platform and a few months later on the .NET platform

Martin Odersky and his team started developing Scala in 2001

Slide 21 www.edureka.co/big-data-and-hadoop
Why Scala?
Type Interference

Runs on JVM ,which means Scala is completely inter-


op with Java

Growing new types and control constructs

Functional and also Object oriented

Allows to define implicit conversions

Have a very powerful libraries and API

 Most of the powerful frameworks like Spark , Kafka


are written in Scala

Slide 22 www.edureka.co/big-data-and-hadoop
Functional Programming
 Apart from being a pure object oriented language, Scala is also a functional programming language

 Functional programming is driven by mainly two ideas:

» First main idea is that functions are first class values. They are treated just like any other type, say
String, Integer etc. So functions can be used as arguments, could be defined in other functions

» The second main idea of functional programming is that the operations of a program should map input
values to output values rather than change data in place. This results in the immutable data structures

 Scala supports both, immutable and mutable data structures. However, the choice is for immutable ones

Slide 23 www.edureka.co/big-data-and-hadoop
Scala REPL edureka!
REPL: Read - Evaluate - Print – Loop

Easiest way to get started with Scala, acts as an interactive shell interpreter

Even though it appears as interpreter, all typed code is converted to Bytecode and executed

Invoked by typing Scala as shown below

Slide 24 www.edureka.co/big-data-and-hadoop
Scala REPL Explained edureka!
After you type an expression, such as 10 + 2, and hit enter: scala>10 + 2

The interpreted will print : res0: Int = 12


scala> 10+2
rese:
Int= 12

This line includes:


» An automatically generated or user-defined name to refer to the computed value (res0, which means
result 0),
» A colon (:), followed by the type of the expression
» An equals sign (=)
» The value resulting from evaluating the expression (12)

Slide 25 www.edureka.co/big-data-and-hadoop
Annie’s Question

Question : Even though Scala is a functional language ,it


does not support the object oriented paradigm.

i)True
ii)False

Slide 26 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Answer : False

Slide 27 www.edureka.co/big-data-and-hadoop
Demo
Execute Scala Code in Spark Shell

Slide 28 www.edureka.co/big-data-and-hadoop
SparkContext
Any spark program starts with creation of spark context object, which provides a way for spark on how to access the
cluster

Represented as sc variable in the spark shell

The master parameter for a SparkContext determines which cluster to use:

Master Description
local Run Spark locally with one worker thread (no parallelism)
local[k] Run Spark locally with K worker threads (ideally set to # cores)
Connect to a Spark standalone cluster;
Spark://HOST:PORT
PORT depends on config (7077 by default)
Connect to a Mesos cluster;
Mesos://HOST:PORT
PORT depends on config (5050 by default)

Slide 29 www.edureka.co/big-data-and-hadoop
SparkContext - Master

Slide 30 www.edureka.co/big-data-and-hadoop
RDD (Resilient Distributed Datasets)
Resilient Distributed Datasets (RDDs), a distributed memory abstraction which lets programmers perform in-memory
computations on large clusters in a fault-tolerant manner

RDDs can be created from any data source e.g. Scala collection, local file system, Hadoop, Amazon S3, HBase table
etc.

Spark supports text files, SequenceFiles, and any other Hadoop InputFormat, and can also take a directory or a
glob(e.g. /ip/2014*)

Even though the RDDs are defined, they don’t contain any data

The computation to create the data in a RDD is only done when the data is referenced; e.g. caching results or writing
out the RDD

It is a very important feature for development process, as code can be written, compiled and even run on
development environment without actually loading the original data!

Slide 31 www.edureka.co/big-data-and-hadoop
Annie’s Question

Question : Creation of spark context object is optional in a


spark program

i)True
ii)False

Slide 32 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Question: False
Answer

Slide 33 www.edureka.co/big-data-and-hadoop
Demo
Explore spark cluster

Slide 34 www.edureka.co/big-data-and-hadoop
Demo
Execute word count on spark local cluster

Slide 35 www.edureka.co/big-data-and-hadoop
Demo
Execute word count on spark stand alone cluster

Slide 36 www.edureka.co/big-data-and-hadoop
Demo
Execute word count on spark stand alone cluster with file system as HDFS

Slide 37 www.edureka.co/big-data-and-hadoop
Demo
Compare the performance of word count with map reduce & spark

Slide 38 www.edureka.co/big-data-and-hadoop
Questions

Slide 39 www.edureka.co/big-data-and-hadoop
Assignment
Execute all the codes discussed in the class

Explore spark cluster

Execute word count with multiple workers

Slide 40 www.edureka.co/big-data-and-hadoop
Further Reading
http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases

http://www.edureka.co/blog/insights-on-hbase-architecture/

http://www.edureka.co/blog/sample-hbase-poc/

Slide 41 www.edureka.co/big-data-and-hadoop
Pre-work
Go through the blogs mentioned in the Further Reading slide

Slide 42 www.edureka.co/big-data-and-hadoop
Agenda for Next Class
 Demo on Flume and Sqoop
 Oozie
 Discussion on a Project

Slide 43 www.edureka.co/big-data-and-hadoop
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!

Please spare few seconds to take the survey after the webinar.

Slide 44 www.edureka.co/big-data-and-hadoop
Slide 45 www.edureka.co/big-data-and-hadoop

Vous aimerez peut-être aussi