Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop

Module 9: Processing Distributed Data
with Apache Spark
www.edureka.co/big-data-and-hadoop
Course Topics
 Module 1  Module 6
» Understanding Big Data and Hadoop » HIVE
» Hadoop Architecture and HDFS » Advance HIVE and HBase
» Hadoop MapReduce Framework » Advance HBase
» Advance MapReduce
» Processing Distributed Data with Apache Spark
 Module 5
» PIG  Module 10
» Oozie and Hadoop Project
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this Module, you will be able to understand
 What is Apache Spark?
 Spark Ecosystem
 Spark Components
 History of Spark and Spark Versions/Releases
 Spark a Polyglot
 What is Scala?
 Why Scala ?
 SparkContext
 RDD
What is Apache Spark?
An in-memory cluster computing platform
Runs lightning fast by performing computations in memory.
Extension of one of the most widely used popular framework - MapReduce
An Unified Stack
Provides rich set of tools like
i) Spark SQL
ii) Spark Streaming
iii) GraphX
iv)MLlib
 Runs on Windows & UNIX systems
Why Apache Spark?
Ability to run faster analytics of large data in memory across cluster of machines
As an unified stack, makes easy adaptation of new components along with existing components
One of the most powerful feature of spark is it can be easily integrated with Hadoop cluster and Cassandra
Can build and run applications that combines different models
Provides supports for different types of data like json, parquet, hive tables, etc.
An interactive shell framework
Spark streaming can be easily integrated with high throughput frameworks like Kafka
Spark – Ecosystem
Spark Spark Spark

Spark SQL SparkR
GraphX Streaming MLib
SPARK CORE
Spark Stand Hadoop

Apache Mesos
Alone Cluster Cluster(YARN)
Spark Components
Spark Core
» An Execution engine of Spark platform SPARK CORE

» All the functionalities of Spark are built on top of core
» Provides API’s to define Resilient Distributed Data Set
Java Scala Python R
» Supports Java, Scala and Python API for developing wide range of
applications
Spark SQL
Data Frame
» Spark package on top of core to work with structured data
» Provides interactive SQL queries to explore data
» Defines DataFrames – Distributed collection of data grouped into SPARK SQL
named columns
» Also supports apache hive SQL(HQL)
SPARK CORE
Spark Components
Spark GraphX
» Graph computation engine built on top of spark

» Provides an efficient Spark API for building graphs
» Extends RDD of SPARK for graph parallel computation
» Extensive support of algorithms which eases the tasks of graph analytics
Kafka
Spark Streaming
Flume
» Built on top of spark to process and analyze streams of data in real time
» Discretized stream processing HDFS/S3
» Represents stream of data with sequence of RDD’s – DSTREAM
» Can be easily integrated with High throughput low latency frameworks Kinesis
like Kafka ,Flume etc.
Twitter
Spark Components
Spark MLlib
MLlib
» A scalable machine language library build on top of spark
spark.mllib
» Uses a ScalaNLP[Scientific Computing, Machine Learning, and
spark.ml
Natural Language Processing] library Breeze
» Spark’s ML library built with extensive algorithms including
regression, clustering and a lot of other components SPARK CORE
SparkR
» Integrated with Spark from version 1.4

» SparkR is a R interface built on top of Spark Core Distributed
» Exposes components like Dataframe, RDD’s into R. Data Set
» Provides a Ligth-weight front end
Spark – A Polyglot
Apache spark can be developed in different languages as shown below
Spark – History
Developed at UC Berkeley Lab in 2009
Started by Matei Zaharia and open sourced in 2010
Written in functional programming language Scala and uses Akka
Developed along with Mesos – A cluster management framework
In 2013 transferred to Apache software foundation
 Spark went on to became top level apache project in march 2014

Matei Zaharia
Spark developed by targeting interactive iterative computations like machine learning
Spark Version & Releases edureka!
Spark 0.6.0 Spark 0.8.0 Spark 1.0.0
5 October 2012 25 September 2013 30 May 2014
» Major release of spark with focus on » First release as part of Apache » API Stability, Integration with YARN
performance enhancements » Introduction of Mlib and support for Security, Spark SQL, MLlib, GraphX
running spark in iPython and Streaming improvements,
» Addition of a standalone deploy mode
Extended Java and Python Support
etc.
Spark Version & Releases edureka!
Spark 1.2.0 Spark 1.4.0 Spark 1.5.0
18 December 2014 11 June 2015 9 Sep 2015
» Performance improvement and » Inclusion of SparkR based on Spark » Sixth release in 1.x of Spark with
stability DataFrame support for Yarn cluster mode in R,
» Enhanced support for Kafka and and introduction of new feature
» Introduction of Python API to Spark
Kinesis transformers
streaming and support through WAL
Annie’s Question
Question : Apache Spark is not compatible with existing

Hadoop infrastructure
a)True
b)False
Annie’s Answer
QuestionFalse
Answer:
Annie’s Question
Question : What are the different clusters supported by

Apache Spark ?
a)Apache Mesos
b)Spark Stand Alone Cluster
c)YARN
d)All of them
Annie’s Answer
QuestionAll of them
Answer:
Demo
Download & Configure Spark
Demo
Invoking Spark Shell
What is Scala?
A general-purpose scalable programming language
 Aimed to implement common programming patterns in a concise, elegant, and type-safe way
Supports both object-oriented and functional programming styles, thus helping programmers to be more productive
Publicly released in January 2004 on the JVM platform and a few months later on the .NET platform
Martin Odersky and his team started developing Scala in 2001
Why Scala?
Type Interference
Runs on JVM ,which means Scala is completely inter-

op with Java
Growing new types and control constructs
Functional and also Object oriented
Allows to define implicit conversions
Have a very powerful libraries and API
 Most of the powerful frameworks like Spark , Kafka

are written in Scala
Functional Programming
 Apart from being a pure object oriented language, Scala is also a functional programming language
 Functional programming is driven by mainly two ideas:
» First main idea is that functions are first class values. They are treated just like any other type, say
String, Integer etc. So functions can be used as arguments, could be defined in other functions
» The second main idea of functional programming is that the operations of a program should map input
values to output values rather than change data in place. This results in the immutable data structures
 Scala supports both, immutable and mutable data structures. However, the choice is for immutable ones
Scala REPL edureka!
REPL: Read - Evaluate - Print – Loop
Easiest way to get started with Scala, acts as an interactive shell interpreter
Even though it appears as interpreter, all typed code is converted to Bytecode and executed
Invoked by typing Scala as shown below
Scala REPL Explained edureka!
After you type an expression, such as 10 + 2, and hit enter: scala>10 + 2
The interpreted will print : res0: Int = 12

scala> 10+2
rese:
Int= 12
This line includes:

» An automatically generated or user-defined name to refer to the computed value (res0, which means
result 0),
» A colon (:), followed by the type of the expression
» An equals sign (=)
» The value resulting from evaluating the expression (12)
Annie’s Question
Question : Even though Scala is a functional language ,it

does not support the object oriented paradigm.
i)True
ii)False
Annie’s Answer
Answer : False
Demo
Execute Scala Code in Spark Shell
SparkContext
Any spark program starts with creation of spark context object, which provides a way for spark on how to access the
cluster
Represented as sc variable in the spark shell
The master parameter for a SparkContext determines which cluster to use:
Master Description
local Run Spark locally with one worker thread (no parallelism)
local[k] Run Spark locally with K worker threads (ideally set to # cores)
Connect to a Spark standalone cluster;
Spark://HOST:PORT
PORT depends on config (7077 by default)
Connect to a Mesos cluster;
Mesos://HOST:PORT
PORT depends on config (5050 by default)
SparkContext - Master
RDD (Resilient Distributed Datasets)
Resilient Distributed Datasets (RDDs), a distributed memory abstraction which lets programmers perform in-memory
computations on large clusters in a fault-tolerant manner
RDDs can be created from any data source e.g. Scala collection, local file system, Hadoop, Amazon S3, HBase table
etc.
Spark supports text files, SequenceFiles, and any other Hadoop InputFormat, and can also take a directory or a
glob(e.g. /ip/2014*)
Even though the RDDs are defined, they don’t contain any data
The computation to create the data in a RDD is only done when the data is referenced; e.g. caching results or writing
out the RDD
It is a very important feature for development process, as code can be written, compiled and even run on
development environment without actually loading the original data!
Annie’s Question
Question : Creation of spark context object is optional in a

spark program
i)True
ii)False
Annie’s Answer
Question: False
Answer
Demo
Explore spark cluster
Demo
Execute word count on spark local cluster
Demo
Execute word count on spark stand alone cluster
Demo
Execute word count on spark stand alone cluster with file system as HDFS
Demo
Compare the performance of word count with map reduce & spark
Questions
Assignment
Execute all the codes discussed in the class
Explore spark cluster
Execute word count with multiple workers
Further Reading
http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases
http://www.edureka.co/blog/insights-on-hbase-architecture/
http://www.edureka.co/blog/sample-hbase-poc/
Pre-work
Go through the blogs mentioned in the Further Reading slide
Agenda for Next Class
 Demo on Flume and Sqoop
 Oozie
 Discussion on a Project
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few seconds to take the survey after the webinar.

Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop

Transféré par

Droits d'auteur :

Formats disponibles

Module 9: Processing Distributed Data

with Apache Spark

 What is Apache Spark?

 History of Spark and Spark Versions/Releases

Runs lightning fast by performing computations in memory.

Extension of one of the most widely used popular framework - MapReduce

An Unified Stack

Provides rich set of tools like

 Runs on Windows & UNIX systems

Can build and run applications that combines different models

An interactive shell framework

Spark Spark Spark

Spark Stand Hadoop

» An Execution engine of Spark platform SPARK CORE

» Graph computation engine built on top of spark

» Integrated with Spark from version 1.4

Started by Matei Zaharia and open sourced in 2010

Written in functional programming language Scala and uses Akka

Developed along with Mesos – A cluster management framework

In 2013 transferred to Apache software foundation

 Spark went on to became top level apache project in march 2014

Spark 0.6.0 Spark 0.8.0 Spark 1.0.0

5 October 2012 25 September 2013 30 May 2014

Spark 1.2.0 Spark 1.4.0 Spark 1.5.0

18 December 2014 11 June 2015 9 Sep 2015

Question : Apache Spark is not compatible with existing

Question : What are the different clusters supported by

Martin Odersky and his team started developing Scala in 2001

Runs on JVM ,which means Scala is completely inter-

Growing new types and control constructs

Functional and also Object oriented

Allows to define implicit conversions

Have a very powerful libraries and API

 Most of the powerful frameworks like Spark , Kafka

 Functional programming is driven by mainly two ideas:

Invoked by typing Scala as shown below

The interpreted will print : res0: Int = 12

This line includes:

Question : Even though Scala is a functional language ,it

Represented as sc variable in the spark shell

The master parameter for a SparkContext determines which cluster to use:

Question : Creation of spark context object is optional in a

Explore spark cluster

Execute word count with multiple workers

Vous aimerez peut-être aussi