Académique Documents
Professionnel Documents
Culture Documents
www.edureka.co/big-data-and-hadoop
Course Topics
Module 1 Module 6
» Understanding Big Data and Hadoop » HIVE
Module 2 Module 7
» Hadoop Architecture and HDFS » Advance HIVE and HBase
Module 3 Module 8
» Hadoop MapReduce Framework » Advance HBase
Module 4 Module 9
» Advance MapReduce
» Processing Distributed Data with Apache Spark
Module 5
» PIG Module 10
» Oozie and Hadoop Project
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this Module, you will be able to understand
Spark Ecosystem
Spark Components
Spark a Polyglot
What is Scala?
Why Scala ?
SparkContext
RDD
Slide 3 www.edureka.co/big-data-and-hadoop
What is Apache Spark?
An in-memory cluster computing platform
i) Spark SQL
ii) Spark Streaming
iii) GraphX
iv)MLlib
Slide 4 www.edureka.co/big-data-and-hadoop
Why Apache Spark?
Ability to run faster analytics of large data in memory across cluster of machines
As an unified stack, makes easy adaptation of new components along with existing components
One of the most powerful feature of spark is it can be easily integrated with Hadoop cluster and Cassandra
Provides supports for different types of data like json, parquet, hive tables, etc.
Spark streaming can be easily integrated with high throughput frameworks like Kafka
Slide 5 www.edureka.co/big-data-and-hadoop
Spark – Ecosystem
SPARK CORE
Slide 6 www.edureka.co/big-data-and-hadoop
Spark Components
Spark Core
Spark SQL
Data Frame
» Spark package on top of core to work with structured data
» Provides interactive SQL queries to explore data
» Defines DataFrames – Distributed collection of data grouped into SPARK SQL
named columns
» Also supports apache hive SQL(HQL)
SPARK CORE
Slide 7 www.edureka.co/big-data-and-hadoop
Spark Components
Spark GraphX
Slide 8 www.edureka.co/big-data-and-hadoop
Spark Components
Spark MLlib
MLlib
» A scalable machine language library build on top of spark
spark.mllib
» Uses a ScalaNLP[Scientific Computing, Machine Learning, and
spark.ml
Natural Language Processing] library Breeze
» Spark’s ML library built with extensive algorithms including
regression, clustering and a lot of other components SPARK CORE
SparkR
Slide 9 www.edureka.co/big-data-and-hadoop
Spark – A Polyglot
Apache spark can be developed in different languages as shown below
Slide 10 www.edureka.co/big-data-and-hadoop
Spark – History
Developed at UC Berkeley Lab in 2009
Slide 11 www.edureka.co/big-data-and-hadoop
Spark Version & Releases edureka!
» Major release of spark with focus on » First release as part of Apache » API Stability, Integration with YARN
performance enhancements » Introduction of Mlib and support for Security, Spark SQL, MLlib, GraphX
running spark in iPython and Streaming improvements,
» Addition of a standalone deploy mode
Extended Java and Python Support
etc.
Slide 12 www.edureka.co/big-data-and-hadoop
Spark Version & Releases edureka!
» Performance improvement and » Inclusion of SparkR based on Spark » Sixth release in 1.x of Spark with
stability DataFrame support for Yarn cluster mode in R,
» Enhanced support for Kafka and and introduction of new feature
» Introduction of Python API to Spark
Kinesis transformers
streaming and support through WAL
Slide 13 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 14 www.edureka.co/big-data-and-hadoop
Annie’s Answer
QuestionFalse
Answer:
Slide 15 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 16 www.edureka.co/big-data-and-hadoop
Annie’s Answer
QuestionAll of them
Answer:
Slide 17 www.edureka.co/big-data-and-hadoop
Demo
Download & Configure Spark
Slide 18 www.edureka.co/big-data-and-hadoop
Demo
Invoking Spark Shell
Slide 19 www.edureka.co/big-data-and-hadoop
Slide 20 www.edureka.co/big-data-and-hadoop
What is Scala?
A general-purpose scalable programming language
Aimed to implement common programming patterns in a concise, elegant, and type-safe way
Supports both object-oriented and functional programming styles, thus helping programmers to be more productive
Publicly released in January 2004 on the JVM platform and a few months later on the .NET platform
Slide 21 www.edureka.co/big-data-and-hadoop
Why Scala?
Type Interference
Slide 22 www.edureka.co/big-data-and-hadoop
Functional Programming
Apart from being a pure object oriented language, Scala is also a functional programming language
» First main idea is that functions are first class values. They are treated just like any other type, say
String, Integer etc. So functions can be used as arguments, could be defined in other functions
» The second main idea of functional programming is that the operations of a program should map input
values to output values rather than change data in place. This results in the immutable data structures
Scala supports both, immutable and mutable data structures. However, the choice is for immutable ones
Slide 23 www.edureka.co/big-data-and-hadoop
Scala REPL edureka!
REPL: Read - Evaluate - Print – Loop
Easiest way to get started with Scala, acts as an interactive shell interpreter
Even though it appears as interpreter, all typed code is converted to Bytecode and executed
Slide 24 www.edureka.co/big-data-and-hadoop
Scala REPL Explained edureka!
After you type an expression, such as 10 + 2, and hit enter: scala>10 + 2
Slide 25 www.edureka.co/big-data-and-hadoop
Annie’s Question
i)True
ii)False
Slide 26 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Answer : False
Slide 27 www.edureka.co/big-data-and-hadoop
Demo
Execute Scala Code in Spark Shell
Slide 28 www.edureka.co/big-data-and-hadoop
SparkContext
Any spark program starts with creation of spark context object, which provides a way for spark on how to access the
cluster
Master Description
local Run Spark locally with one worker thread (no parallelism)
local[k] Run Spark locally with K worker threads (ideally set to # cores)
Connect to a Spark standalone cluster;
Spark://HOST:PORT
PORT depends on config (7077 by default)
Connect to a Mesos cluster;
Mesos://HOST:PORT
PORT depends on config (5050 by default)
Slide 29 www.edureka.co/big-data-and-hadoop
SparkContext - Master
Slide 30 www.edureka.co/big-data-and-hadoop
RDD (Resilient Distributed Datasets)
Resilient Distributed Datasets (RDDs), a distributed memory abstraction which lets programmers perform in-memory
computations on large clusters in a fault-tolerant manner
RDDs can be created from any data source e.g. Scala collection, local file system, Hadoop, Amazon S3, HBase table
etc.
Spark supports text files, SequenceFiles, and any other Hadoop InputFormat, and can also take a directory or a
glob(e.g. /ip/2014*)
Even though the RDDs are defined, they don’t contain any data
The computation to create the data in a RDD is only done when the data is referenced; e.g. caching results or writing
out the RDD
It is a very important feature for development process, as code can be written, compiled and even run on
development environment without actually loading the original data!
Slide 31 www.edureka.co/big-data-and-hadoop
Annie’s Question
i)True
ii)False
Slide 32 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Question: False
Answer
Slide 33 www.edureka.co/big-data-and-hadoop
Demo
Explore spark cluster
Slide 34 www.edureka.co/big-data-and-hadoop
Demo
Execute word count on spark local cluster
Slide 35 www.edureka.co/big-data-and-hadoop
Demo
Execute word count on spark stand alone cluster
Slide 36 www.edureka.co/big-data-and-hadoop
Demo
Execute word count on spark stand alone cluster with file system as HDFS
Slide 37 www.edureka.co/big-data-and-hadoop
Demo
Compare the performance of word count with map reduce & spark
Slide 38 www.edureka.co/big-data-and-hadoop
Questions
Slide 39 www.edureka.co/big-data-and-hadoop
Assignment
Execute all the codes discussed in the class
Slide 40 www.edureka.co/big-data-and-hadoop
Further Reading
http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases
http://www.edureka.co/blog/insights-on-hbase-architecture/
http://www.edureka.co/blog/sample-hbase-poc/
Slide 41 www.edureka.co/big-data-and-hadoop
Pre-work
Go through the blogs mentioned in the Further Reading slide
Slide 42 www.edureka.co/big-data-and-hadoop
Agenda for Next Class
Demo on Flume and Sqoop
Oozie
Discussion on a Project
Slide 43 www.edureka.co/big-data-and-hadoop
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few seconds to take the survey after the webinar.
Slide 44 www.edureka.co/big-data-and-hadoop
Slide 45 www.edureka.co/big-data-and-hadoop