Vous êtes sur la page 1sur 11

5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium

Set up Apache Spark on a Multi-Node


Cluster
rahul nayak Follow
Mar 8, 2018 · 4 min read

This blog explains how to install Apache Spark on a multi-node cluster.


This guide provides step by step instructions to deploy and configure
Apache Spark on the real multi-node cluster.

Spark is a fast and powerful framework that provides


an API to perform massive distributed processing
over resilient sets of data.

Recommended Platform

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 1/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium

OS - Linux is supported as development and deployment platform.


Windows is supported as a dev platform.

For Apache Spark installation on a multi-node cluster, we will be


needing multiple nodes, for that we can use multiple machines or AWS
instances.

Spark Architecture
Apache Spark follows a master/slave architecture with two main
daemons and a cluster manager –

• Master Daemon — (Master/Driver Process)

• Worker Daemon –(Slave Process)

• Cluster Manager

Spark Architecture

A spark cluster has a single Master and any number of Slaves/Workers.


The driver and the executors run their individual Java processes and
users can run them on the same horizontal spark cluster or on separate
machines i.e. in a vertical spark cluster or in mixed machine
configuration.

Prerequisites
Create a user of same name in master and all slaves to make your tasks
easier during ssh and also switch to that user in master.

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 2/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium

Add entries in hosts file (master and slaves)


Edit hosts file.

$ sudo vim /etc/hosts

Now add entries of master and slaves in hosts file.

<MASTER-IP> master
<SLAVE01-IP> slave01
<SLAVE02-IP> slave02

Install Java 7 (master and slaves)

$ sudo apt-get install python-software-properties


$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

To check if java is installed, run the following command.

$ java -version

Install Scala (master and slaves)

$ sudo apt-get install scala

To check if scala is installed, run the following command.

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 3/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium

$ scala -version

Configure SSH (only master)


Install Open SSH Server-Client

$ sudo apt-get install openssh-server openssh-client

Generate key pairs

$ ssh-keygen -t rsa -P ""

Configure passwordless SSH

Copy the content of .ssh/id_rsa.pub (of master) to .ssh/authorized_keys


(of all the slaves as well as master).

Check by SSH to all the slaves

$ ssh slave01
$ ssh slave02

Install Spark
Download latest version of Spark
Use the following command to download latest version of apache
spark.

$ wget http://www-us.apache.org/dist/spark/spark-
2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 4/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium

Extract Spark tar


Use the following command for extracting the spark tar file.

$ tar xvf spark-2.3.0-bin-hadoop2.7.tgz

Move Spark software files


Use the following command to move the spark software files to
respective directory (/usr/local/bin)

$ sudo mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark

Set up the environment for Spark


Edit bashrc file.

$ sudo vim ~/.bashrc

Add the following line to ~/.bashrc file. It means adding the location,
where the spark software file are located to the PATH variable.

export PATH = $PATH:/usr/local/spark/bin

Use the following command for sourcing the ~/.bashrc file.

$ source ~/.bashrc

Note: The whole spark installation procedure must be done in master as


well as in all slaves.

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 5/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium

Spark Master Configuration


Do the following procedures only in master.

Edit spark-env.sh
Move to spark conf folder and create a copy of template of spark-env.sh
and rename it.

$ cd /usr/local/spark/conf
$ cp spark-env.sh.template spark-env.sh

Now edit the configuration file spark-env.sh.

$ sudo vim spark-env.sh

And set the following parameters.

export SPARK_MASTER_HOST='<MASTER-IP>'

export JAVA_HOME=<Path_of_JAVA_installation>

Add Workers
Edit the configuration file slaves in (/usr/local/spark/conf).

$ sudo vim slaves

And add the following entries.

master
slave01
slave02

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 6/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium

Start Spark Cluster


To start the spark cluster, run the following command on master.

$ cd /usr/local/spark
$ ./sbin/start-all.sh

To stop the spark cluster, run the following command on master.

$ cd /usr/local/spark
$ ./sbin/stop-all.sh

Check whether services have been started


To check daemons on master and slaves, use the following command.

$ jps

Spark Web UI
Browse the Spark UI to know about worker nodes, running application,
cluster resources.

Spark Master UI

http://<MASTER-IP>:8080/

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 7/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium

Spark Application UI

http://<MASTER_IP>:4040/

You can proceed further with Spark shell commands


to play with Spark.

. . .

Y ou can use the following link to know how to use PySpark (Spark
Python API) on Jupyter Notebook.

Get Started with PySpark and Jupyter Notebook


in 3 Minutes

Apache Spark is a must for Big data’s lovers. In a


few words, Spark is a fast and powerful framewo…
blog.sicara.com

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 8/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 9/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 10/11
5/2/2019 Set up Apache Spark on a Multi-Node Cluster – Y Media Labs Innovation – Medium

https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b 11/11

Vous aimerez peut-être aussi