Vous êtes sur la page 1sur 33

Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Inicia sesión en medium.com con Google

William Trigos

Getting Started with PySpark for


william.trigos@gmail.com

Big Data Analytics using Jupyter CONTINUAR COMO WILLIAM

Notebooks and Jupyter Docker


Stacks
Gary A. Sta�ord Follow
Nov 22, 2018 · 16 min read

Introduction
There is little question, big data analytics, data science, artificial
intelligence (AI), and machine learning (ML), a subcategory of AI,
have all experienced a tremendous surge in popularity over the last
few years. Behind the hype curves and marketing buzz, these
technologies are having a significant influence on all aspects of our
modern lives. Due to their popularity and potential benefits,
academic institutions and commercial enterprises are rushing to train
large numbers of Data Scientists and ML and AI Engineers.

Search results courtesy GoogleTrends (https://trends.google.com)

Learning popular programming paradigms, such as Python, Scala, R,


Apache Hadoop, Apache Spark, and Apache Kafka, requires the use
of multiple complex technologies. Installing, configuring, and

1 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

managing these technologies often demands an advanced level of


familiarity with Linux, distributed systems, cloud- and container-
based platforms, databases, and data-streaming applications. These
barriers may prove a deterrent to Students, Mathematicians,
Statisticians, and Data Scientists.

Search results courtesy GoogleTrends (https://trends.google.com)

Driven by the explosive growth of these technologies and the need to


train individuals, many commercial enterprises are lowering the
barriers to entry, making it easier to get started. The three major
cloud providers, AWS, Azure, and Google Cloud, all have multiple Big
Data-, AI- and ML-as-a-Service offerings.

Similarly, many open-source projects are also lowering the barriers to


entry into these technologies. An excellent example of an open-
source project working on this challenge is Project Jupyter. Similar to
the Spark Notebook and Apache Zeppelin projects, Jupyter
Notebooks enables data-driven, interactive, and collaborative data
analytics with Julia, Scala, Python, R, and SQL.

This post will demonstrate the creation of a containerized


development environment, using Jupyter Docker Stacks. The
environment will be suited for learning and developing applications
for Apache Spark, using the Python, Scala, and R programming
languages. This post is not intended to be a tutorial on Spark,
PySpark, or Jupyter Notebooks.

2 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Featured Technologies
The following technologies are featured prominently in this post.

Jupyter Notebooks
According to Project Jupyter, the Jupyter Notebook, formerly known
as the IPython Notebook, is an open-source web application that
allows users to create and share documents that contain live code,
equations, visualizations, and narrative text. Uses include data
cleaning and transformation, numerical simulation, statistical
modeling, data visualization, machine learning, and much more. The
word, Jupyter, is a loose acronym for Julia, Python, and R, but today,
the Jupyter supports many programming languages. Interest in
Jupyter Notebooks has grown dramatically.

Search results courtesy GoogleTrends (https://trends.google.com)

3 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Jupyter Docker Stacks


To enable quick and easy access to Jupyter Notebooks, Project Jupyter
has created Jupyter Docker Stacks. The stacks are ready-to-run
Docker images containing Jupyter applications, along with
accompanying technologies. Currently, eight different Jupyter Docker
Stacks focus on a particular area of practice. They include SciPy
(Python-based mathematics, science, and engineering), TensorFlow,
R Project for statistical computing, Data Science with Julia, and the
main subject of this post, PySpark. The stacks also include a rich
variety of well-known packages to extend their functionality, such as
scikit-learn, pandas, Matplotlib, Bokeh, ipywidgets (interactive HTML
widgets), and Facets.

Apache Spark
According to Apache, Spark is a unified analytics engine for large-
scale data processing, used by well-known, modern enterprises, such
as Netflix, Yahoo, and eBay. With speeds up to 100x faster than
Hadoop, Apache Spark achieves high performance for static, batch,
and streaming data, using a state-of-the-art DAG (Directed Acyclic
Graph) scheduler, a query optimizer, and a physical execution engine.

Spark’s polyglot programming model allows users to write


applications quickly in Scala, Java, Python, R, and SQL. Spark
includes libraries for Spark SQL (DataFrames and Datasets), MLlib
(Machine Learning), GraphX (Graph Processing), and DStreams
(Spark Streaming). You can run Spark using its standalone cluster
mode, on Amazon EC2, Apache Hadoop YARN, Mesos, or Kubernetes.

PySpark
The Spark Python API, PySpark, exposes the Spark programming
model to Python. PySpark is built on top of Spark’s Java API. Data is
processed in Python and cached and shuffled in the JVM. According

4 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

to Apache, Py4J enables Python programs running in a Python


interpreter to dynamically access Java objects in a JVM.

Docker
According to Docker, their technology developers and IT the freedom
to build, manage and secure business-critical applications without the
fear of technology or infrastructure lock-in. Although Kubernetes is
now the leading open-source container orchestration platform,
Docker is still the predominant underlying container engine
technology. For this post, I am using Docker Desktop Community
version for MacOS.

Docker Swarm
Current versions of Docker include both a Kubernetes and Swarm
orchestrator for deploying and managing containers. We will choose
Swarm for this demonstration. According to Docker, Swarm is the
cluster management and orchestration features embedded in the
Docker Engine are built using swarmkit. Swarmkit is a separate

5 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

project which implements Docker’s orchestration layer and is used


directly within Docker.

PostgreSQL
PostgreSQL is a powerful, open source object-relational database
system. According to their website, PostgreSQL comes with many
features aimed to help developers build applications, administrators
to protect data integrity and build fault-tolerant environments, and
help manage data no matter how big or small the dataset.

Demonstration
To show the capabilities of the Jupyter development environment, I
will demonstrate a few typical use cases, such as executing Python
scripts, submitting PySpark jobs, working with Jupyter Notebooks,
and reading and writing data to and from different format files and to
a database. We will be using the jupyter/all-spark-notebook Docker
Image. This image includes Python, R, and Scala support for Apache
Spark, using Apache Toree.

Architecture
As shown below, we will stand-up a Docker stack, consisting of
Jupyter All-Spark-Notebook, PostgreSQL 10.5, and Adminer
containers. The Docker stack will have local directories bind-mounted
into the containers. Files from our GitHub project will be shared with
the Jupyter application container through a bind-mounted directory.
Our PostgreSQL data will also be persisted through a bind-mounted
directory. This allows us to persist data external to the ephemeral
containers.

6 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Source Code
All open-sourced code for this post can be found on GitHub. Use the
following command to clone the project. The post and project code
was updated on 6/7/2019.

git clone \
--branch master --single-branch --depth 1 --no-tags \
https://github.com/garystafford/pyspark-setup-demo.git

Source code samples are displayed as GitHub Gists, which may not
display correctly on some mobile and social media browsers.

Deploy Docker Stack


To start, create the $HOME/data/postgres directory to store
PostgreSQL data files. This directory will be bind mounted into the
PostgreSQL container on line 36 of the stack.yml file, $HOME/data

/postgres:/var/lib/postgresql/data . The HOME environment variable


assumes you are working on Linux or MacOS, and is equivalent to
HOMEPATH on Windows.

The Jupyter container’s working directory is set on line 10 of the

7 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

stack.yml file, working_dir:/home/$USER/work . The local bind-

mounted working directory is $PWD/work . This path is bind-mounted

to the working directory in the Jupyter container, on line 24 of the


stack.yml file, $PWD/work:/home/$USER/work . The PWD environment
variable assumes you are working on Linux or MacOS ( CD on
Windows).

By default, the user within the Jupyter container is jovyan .

Optionally, I have chosen to override that user with my own local


host’s user account, as shown on line 16 of the stack.yml file,
NB_USER: $USER . I have used the MacOS host’s USER environment
variable value (equivalent to USERNAME on Windows). There are many
options for configuring the Jupyter container, detailed here. Several
of those options are shown on lines 12-18 of the stack.yml file (gist).

8 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

1 version: "3.7"
2 services:
3 pyspark:
4 image: jupyter/all‐spark‐notebook:latest
5 ports:
6 ‐ "8888:8888/tcp"
7 ‐ "4040:4040/tcp"
8 networks:
9 ‐ pyspark‐net
10 working_dir: /home/$USER/work
11 environment:
12 CHOWN_HOME: "yes"
13 GRANT_SUDO: "yes"
14 NB_UID: 1000
15 NB_GID: 100
16 NB_USER: $USER
17 NB_GROUP: staff
18 user: root
19 deploy:
20 replicas: 1
21 restart_policy:
22 condition: on‐failure
23 volumes:
24 ‐ $PWD/work:/home/$USER/work
25 postgres:
26 image: postgres:11.3
27 environment:
28 POSTGRES_USERNAME: postgres
29 POSTGRES_PASSWORD: postgres1234
30 POSTGRES_DB: demo
31 ports:
32 ‐ "5432:5432/tcp"
33 networks:
34 ‐ pyspark‐net
35 volumes:
36 ‐ $HOME/data/postgres:/var/lib/postgresql/data
37 deploy:
38 restart_policy:
39 condition: on‐failure
40 adminer:
41 image: adminer:latest
42 ports:
43 ‐ "8080:8080/tcp"
44 networks:
45 ‐ pyspark‐net
46 deploy:
47 restart_policy:
48 condition: on‐failure
49

Assuming you have a recent version of Docker installed on your local

9 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

development machine, and running in swarm mode, standing up the


stack is as easy as running the following command from the root
directory of the project:

docker stack deploy -c stack.yml pyspark

The Docker stack consists of a new overlay network, pyspark-net ,

and the three containers. To confirm the stack deployed, you can run
the following command:

docker stack ps pyspark --no-trunc

Note the jupyter/all-spark-notebook container is quite large.


Depending on your Internet connection, if this is the first time you
have pulled this Docker image, the stack may take several minutes to
enter a running state.

To access the Jupyter Notebook application, you need to obtain the


Jupyter URL and access token (read more here). This information is
output in the Jupyter container log, which can be accessed with the
following command:

10 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

docker logs $(docker ps | grep pyspark_pyspark | awk '{print


$NF}')

Using the URL and token shown in the log output, you will be able to
access the Jupyter web-based user interface on localhost port 8888.
Once there, from Jupyter dashboard landing page, you should see all
the files in the project’s work/ directory.

Also shown below, note the types of files you are able to create from
the dashboard, including Python 3, R, Scala (using Toree or spylon-
kernal), and text. You can also open a Jupyter Terminal or create a
new Folder.

11 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Running Python Scripts


Instead of worrying about installing and maintaining the latest
version of Python and packages on your own development machine,
we can run our Python scripts from the Jupyter container. At the time
of this post, the latest jupyter/all-spark-notebook Docker Image
runs Python 3.7.3 and Conda 4.6.14. Let’s start with a simple
example of the Jupyter container’s capabilities by running a Python
script. I’ve included a sample Python script, 01_simple_script.py.

1 #!/usr/bin/python
2
3 import random
4
5 technologies = ['PySpark', 'Python', 'Spark', 'Scala', 'JVM',
6 'Project Jupyter', 'PostgreSQL']
7 print("Technologies: %s" % technologies)
8
9 technologies.sort()
10 print("Sorted: %s" % technologies)
11
12 print("I'm interested in learning %s." % random.choice(technologies))

Run the script from within the Jupyter container, from a Jupyter
Terminal window:

12 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

python ./01_simple_script.py

You should observe the following output.

Kaggle Datasets
To explore the features of the Jupyter Notebook container and
PySpark, we will use a publically-available dataset from Kaggle.
Kaggle is a fantastic open-source resource for datasets used for big-
data and ML applications. Their tagline is ‘Kaggle is the place to do
data science projects’.

For this demonstration, I chose the ‘Transactions from a Bakery’


dataset from Kaggle.

13 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

The dataset contains 21,294 rows, each with four columns of data.
Although certainly nowhere near ‘big data’, the dataset is large
enough to test out the Jupyter container functionality (gist).

Search this file…
1 Date Time Transaction Item
2 2016-10-30 09:58:11 1 Bread
3 2016-10-30 10:05:34 2 Scandinavian
4 2016-10-30 10:05:34 2 Scandinavian
5 2016-10-30 10:07:57 3 Hot chocolate
6 2016-10-30 10:07:57 3 Jam
7 2016-10-30 10:07:57 3 Cookies
8 2016-10-30 10:08:41 4 Muffin
9 2016-10-30 10:13:03 5 Coffee
10 2016-10-30 10:13:03 5 Pastry
11 2016-10-30 10:13:03 5 Bread

bakery_data.csv hosted with ❤ by GitHub view raw

Submitting Spark Jobs


We are not limited to Jupyter Notebooks to interact with Spark, we
can also submit scripts directly to Spark from a Jupyter Terminal, or
from our IDE. I have included a simple Python script,
02_bakery_dataframes.py. The script loads the Kaggle Bakery dataset
from the CSV file into a Spark DataFrame. The script then prints out
the top ten rows of data, along with a count of the total number of

14 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

rows in the DataFrame.

1 #!/usr/bin/python
2
3 from pyspark.sql import SparkSession
4 from pyspark.sql.types import StructType, StructField, StringType, IntegerType
5
6 spark = SparkSession \
7 .builder \
8 .appName('pyspark_demo_app') \
9 .config('spark.driver.extraClassPath',
10 'postgresql‐42.2.5.jar') \
11 .getOrCreate()
12
13 sc = spark.sparkContext
14
15 bakery_schema = StructType([
16 StructField('date', StringType(), True),
17 StructField('time', StringType(), True),
18 StructField('transaction', IntegerType(), True),
19 StructField('item', StringType(), True)
20 ])
21
22 df3 = spark.read \
23 .format('csv') \
24 .option('header', 'true') \
25 .load('BreadBasket_DMS.csv', schema=bakery_schema)
26
27 df3.show(10)

Run the script directly from a Jupyter Terminal window:

python ./02_bakery_dataframes.py

An example of the output of the Spark job is shown below. At the time
of this post, the latest jupyter/all-spark-notebook Docker Image
runs Spark 2.4.3, Scala 2.11.12, and Java 1.8.0_191 using the
OpenJDK.

15 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

More typically, you would submit the Spark job, using the spark-

submit command. Use a Jupyter Terminal window to run the


following command:

$SPARK_HOME/bin/spark-submit 02_bakery_dataframes.py

Below, we see the beginning of the output from Spark, using the
spark-submit command.

16 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Below, we see the scheduled tasks executing and the output of the
print statement, displaying the top 10 rows of bakery data.

Interacting with Databases


Often with Spark, you are loading data from one or more data
sources (input). After performing operations and transformations on
the data, the data is persisted or conveyed to another system for
further processing (output).

To demonstrate the flexibility of the Jupyter Docker Stacks to work


with databases, I have added PostgreSQL to the Docker Stack. We can
read and write data from the Jupyter container to the PostgreSQL
instance, running in a separate container.

To begin, we will run a SQL script, written in Python, to create our


database schema and some test data in a new database table. To do
so, we will need to install the psycopg2 package into our Jupyter
container. You can use the docker exec command from your
terminal. Alternatively, as a superuser, your user has administrative
access to install Python packages within the Jupyter container using
the Jupyter Terminal window. Both pip and conda are available to
install packages, see details here.

17 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Run the following command to install psycopg2 :

# using pip
docker exec -it \
$(docker ps | grep pyspark_pyspark | awk '{print $NF}') \
pip install psycopg2-binary

These packages give Python the ability to interact with the


PostgreSQL. The included Python script, 03_load_sql.py, will execute
a set of SQL statements, contained in a SQL file, bakery_sample.sql,
against the PostgreSQL container instance.

1 #!/usr/bin/python
2
3 import psycopg2
4
5 # source: https://stackoverflow.com/questions/45805871/python3‐psycopg2‐execute‐sql‐file
6
7 connect_str = 'host=postgres port=5432 dbname=demo user=postgres password=postgres1234'
8 conn = psycopg2.connect(connect_str)
9 conn.autocommit = True
10 cursor = conn.cursor()
11
12 sql_file = open('bakery_sample.sql', 'r')
13 sqlFile = sql_file.read()
14 sql_file.close()
15 sqlCommands = sqlFile.split(';')
16 for command in sqlCommands:
17 print(command)
18 if command.strip() != '':
19 cursor.execute(command)

To execute the script, run the following command:

python ./03_load_sql.py

This should result in the following output, if successful.

18 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

To confirm the SQL script’s success, I have included Adminer.


Adminer (formerly phpMinAdmin) is a full-featured database
management tool written in PHP. Adminer natively recognizes
PostgreSQL, MySQL, SQLite, and MongoDB, among other database
engines.

Adminer should be available on localhost port 8080. The password


credentials, shown below, are available in the stack.yml file. The
server name, postgres , is the name of the PostgreSQL container. This

is the domain name the Jupyter container will use to communicate


with the PostgreSQL container.

19 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Connecting to the demo database with Adminer, we should see the


bakery_basket table. The table should contain three rows of data, as
shown below.

Developing Jupyter NoteBooks


The true power of the Jupyter Docker Stacks containers is Jupyter
Notebooks. According to the Jupyter Project, the notebook extends
the console-based approach to interactive computing in a
qualitatively new direction, providing a web-based application

20 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

suitable for capturing the whole computation process: developing,


documenting, and executing code, as well as communicating the
results. Notebook documents contain the inputs and outputs of an
interactive session as well as additional text that accompanies the
code but is not meant for execution.

To see the power of Jupyter Notebooks, I have written a basic


notebook document, 04_pyspark_demo_notebook.ipynb. The
document performs some typical PySpark functions, such as loading
data from a CSV file and from the PostgreSQL database, performing
some basic data analytics with Spark SQL, graphing the data using
BokehJS, and finally, saving data back to the database, as well as to
the popular Apache Parquet file format. Below we see the notebook
document, using the Jupyter Notebook user interface.

PostgreSQL Driver
The only notebook document dependency, not natively part of the
Jupyter Image, is the PostgreSQL JDBC driver. The driver,
postgresql-42.2.5.jar , is included in the project and referenced in

the configuration of the notebook’s Spark Session. The JAR is added


to the spark.driver.extraClassPath runtime environment property.

21 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

This ensures the JAR is available to Spark (written in Scala) when the
job is run.

PyCharm
Since the working directory for the project is shared with the
container, you can also edit files, including notebook documents, in
your favorite IDE, such as JetBrains PyCharm. PyCharm has built-in
language support for Jupyter Notebooks, as shown below.

As mentioned earlier, a key feature of Jupyter Notebooks is their


ability to save the output from each Cell as part of the notebook
document. Below, we see the notebook document on GitHub. The
output is saved, as part of the notebook document. Not only can you
distribute the notebook document, but you can also preserve and
share the output from each cell.

22 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Using Additional Packages


As mentioned in the Introduction, the Jupyter Docker Stacks come
ready-to-run, with a rich variety of Python packages to extend their
functionality. To demonstrate the use of these packages, I have
created a second Jupyter notebook document,
05_pyspark_demo_notebook.ipynb. This notebook document uses
SciPy (Python-based mathematics, science, and engineering), NumPy
(Python-based scientific computing), and the Plotly Python Graphing
Library. While NumPy and SciPy are included on the Jupyter Docker
Image, the Notebook used pip to install Plotly. Similar to Bokeh,
shown previously, we can combine these libraries to create rich
interactive data visualizations. To use Plotly, you will need to sign up
for a free account and obtain a username and API key.

Shown below, we use Plotly to construct a bar chart of daily bakery


items sold for the year 2017 based on the Kaggle dataset. The chart
uses SciPy and NumPy to construct a linear fit (regression) and plot a
line of best fit to the bakery data. The chart also uses SciPy’s Savitzky-
Golay Filter to plot the second line, illustrating a smoothing of our
bakery data.

23 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Plotly also provides Chart Studio Online Chart Maker. Plotly describes
Chart Studio as the world’s most sophisticated editor for creating
D3.js and WebGL charts. Shown below, we have the ability to
enhance, stylize, and share our bakery data visualization using the
free version of Chart Studio Cloud.

nbviewer
Notebooks can also be viewed using Jupyter nbviewer, hosted on

24 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Rackspace. Below, we see the output of a cell from this project’s


notebook document, showing a BokehJS chart, using nbviewer. You
can view this project’s actual notebook document, using nbviewer,
here.

Monitoring Spark Jobs


The Jupyter Docker container exposes Spark’s monitoring and
instrumentation web user interface. We can observe each Spark Job
in great detail.

25 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

We can review details of each stage of the Spark job, including a


visualization of the DAG, which Spark constructs as part of the job
execution plan, using the DAG Scheduler.

We can also review the timing of each event, occurring as part of the
stages of the Spark job.

26 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

We can also use the Spark interface to review and confirm the
runtime environment, including versions of Java, Scala, and Spark, as
well as packages available on the Java classpath.

Spark Performance
Spark, running on a single node within the Jupyter container, on your
development system, is not a substitute for a full Spark cluster,
running on bare metal or robust virtualized hardware, with YARN,
Mesos, or Kubernetes. In my opinion, you should adjust Docker to

27 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

support an acceptable performance profile for the stack, running only


a modest workload. You are not trying to replace the need to run real
jobs on a Production Spark cluster.

We can use the docker stats command to examine the container’s


CPU and memory metrics:

docker stats \
--format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}
\t{{.MemPerc}}"

Below, we see the stats from the stack’s three containers immediately
after being deployed, showing little or no activity. Here, Docker has
been allocated 2 CPUs, 3GB of RAM, and 2 GB of swap space
available, from the host machine.

28 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Compare the stats above with the same three containers, while the
example notebook document is running on Spark. The CPU shows a
spike, but memory usage appears to be within acceptable ranges.

Linux top and htop


Another option to examine container performance metrics is with
top. We can use the docker exec command to execute the top

command within the Jupyter container, sorting processes by CPU


usage:

docker exec -it \


$(docker ps | grep _pyspark | awk '{print $NF}') \
top -o %CPU

29 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

With top , we can observe the individual performance of each

processes running in the Jupyter container.

Lastly, htop, an interactive process viewer for Unix, can be installed


into the container and ran with the following set of bash commands,
from a Jupyter Terminal window or using docker exec :

docker exec -it \


$(docker ps | grep _pyspark | awk '{print $NF}') \
sh -c "apt-get update && apt-get install htop && htop --sort-
key PERCENT_CPU"

With htop , we can observe individual CPU activity. The two CPUs at

the top left of the htop window are the two CPUs assigned to Docker.

We get insight into the way Docker is using each CPU, as well as other
basic performance metrics, like memory and swap.

30 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Assuming your development machine host has them available, it is


easy to allocate more compute resources to Docker if required.
However, in my opinion, this stack is optimized for development and
learning, using reasonably sized datasets for data analysis and ML. It
should not be necessary to allocate excessive resources to Docker,
possibly starving your host machine own compute capabilities.

31 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Conclusion
In this brief post, we have seen how easy it is to get started learning
and developing applications for big data analytics, using Python,
Spark, and PySpark, thanks to the Jupyter Docker Stacks. We could
use the same stack to learn and develop for machine learning, using
Python, Scala, and R. Extending the stack’s capabilities is as simple as
swapping out this Jupyter image for another, with a different set of
tools, as well as adding additional containers to the stack, such as
Apache Kafka or Apache Cassandra.

Originally published at programmaticponderings.com on November 20,


2018. The post and project code was updated on June 7, 2019.

All opinions expressed in this post are my own and not necessarily the
views of my current or past employers, their clients.

32 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...

Apache Spark Pyspark Python Docker Jupyter Notebook

Discover Medium Make Medium yours Become a member


Welcome to a place where Follow all the topics you care Get unlimited access to the
words matter. On Medium, about, and we’ll deliver the best stories on Medium — and
smart voices and original ideas best stories for you to your support writers while you’re at
take center stage - with no homepage and inbox. Explore it. Just $5/month. Upgrade
ads in sight. Watch

About Help Legal

33 of 33 5/08/19, 12:06 p. m.