Deployment of Standalone Spark Cluster on Kubernetes

7 min readJun 11, 2020

Spark is a well-known engine for processing big data. As a first step to learn Spark, I will try to deploy a Spark cluster on Kubernetes in my local machine.

A Standalone Spark cluster consists of a master node and several worker nodes. There are several ways to deploy a Spark cluster. I prefer Kubernetes because it is a super convenient way to deploy and manage containerized applications.

In this post, I will deploy a Standalone Spark cluster on a single-node Kubernetes cluster in Minikube. The Spark master and workers are containerized applications in Kubernetes. However, in this case, the cluster manager is not Kubernetes. It is Standalone, a simple cluster manager included with Spark that makes it easy to set up a cluster.

Photo from https://unsplash.com/photos/0fRjMvmjG2A

1. Spark cluster overview

Apache Spark is a unified analytics engine for large-scale data processing. It achieves high performance for both batch and streaming data and offers high-level operations that can be used interactively from Scala, Python, R and SQL.

This part gives a short overview of how Spark runs on cluster. There are some components involved when a Spark application is launched.

Cluster overview. Figure from https://spark.apache.org/docs/latest/cluster-overview.html

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for the application. Next, it sends the application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

More detail is at: https://spark.apache.org/docs/latest/cluster-overview.html

2. Spark on Kubernetes

In this post, Spark master and workers are like containerized applications in Kubernetes. They are deployed in Pods and accessed via Service objects.

2.1. Prerequisite

The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs.

Minikube: a tool that runs a single-node Kubernetes cluster in a virtual machine on your personal computer.

Docker: a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and deploy it as one package.

minikube can be installed following the instruction here. From Spark documentation, it notes that the default minikube configuration is not enough for running Spark applications and recommends 3 CPUs and 4g of memory to be able to start a simple Spark application with a single executor.

Start minikube with the memory and CPU options

$ minikube start --driver=virtualbox --memory 8192 --cpus 4

2.2. Some concept

Pods are the smallest deployable units of computing that can be created and managed in Kubernetes. A Pod (as in a pod of whales or pea pod) is a group of one or more containers (such as Docker containers), with shared storage/network, and a specification for how to run the containers.

A ReplicationController ensures that a specified number of pod replicas are running at any one time. In other words, a ReplicationController makes sure that a pod or a homogeneous set of pods is always up and available.

Service is an abstraction which defines a logical set of Pods and a policy by which to access them (sometimes this pattern is called a micro-service).

3. Docker images

All the source code is at: https://github.com/KienMN/Standalone-Spark-on-Kubernetes

3.1. Spark container

The first step is to build a Docker image for Spark master and workers. The source code along with Dockerfile is here: https://github.com/KienMN/Standalone-Spark-on-Kubernetes/tree/master/images/spark-standalone

We build the image and push it to the Dockerhub (or any Docker registry)

$ docker build . -t {dockerhub-username}/{image-name}:{image-tag}
$ docker push {dockerhub-username}/{image-name}:{image-tag}

3.2. Spark ui proxy

Spark comes with its own Web UI. However, accessing Web UI of Spark master, workers or drivers can be burdensome due to the complexity of network configuration behind Kubernetes.

Spark UI Proxy is a solution to reduce the burden to accessing web UI of Spark on different pods.

Dockerfile is available here https://github.com/KienMN/Standalone-Spark-on-Kubernetes/tree/master/images/spark-ui-proxy

Use the same commands above to build and push images to the Docker hub (or any Docker registry)

4. Deploy Spark cluster

4.1. Deploy Spark master

Deploy the Spark master with controller.yaml file.

kind: ReplicationController
apiVersion: v1
metadata:
  name: spark-master-controller
spec:
  replicas: 1
  selector:
    component: spark-master
  template:
    metadata:
      labels:
        component: spark-master
    spec:
      hostname: spark-master-hostname
      subdomain: spark-master-headless
      containers:
        - name: spark-master
          image: kienmn97/test_spark_kubernetes:2.4.5
          imagePullPolicy: Always
          command: ["/start-master"]
          ports:
            - containerPort: 7077
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m

I will deploy 1 pod for Spark master and expose port 7077 (for service to listen on) and 8080 (for web UI). I also specify selector to be used in Service.

$ kubectl create -f spark/spark-master/controller.yaml

For the Spark master nodes to be discoverable by the Spark worker nodes, we’ll also need to create a headless service. The configuration is in service.yaml file.

kind: Service
apiVersion: v1
metadata:
  name: spark-master-headless
spec:
  ports:
  clusterIP: None
  selector:
    component: spark-master
---
kind: Service
apiVersion: v1
metadata:
  name: spark-master
spec:
  ports:
    - port: 7077
      targetPort: 7077
      name: spark
    - port: 8080
      targetPort: 8080
      name: http
  selector:
    component: spark-master

Create these services.

$ kubectl create -f spark/spark-master/service.yaml

4.2. Deploy Spark worker

Deploy the Spark worker with configuration in controller.yaml file.

$ kubectl create -f spark/spark-worker/controller.yaml

4.3. Start spark ui proxy

Finally, create deployment and service for Spark UI Proxy to allow easy access to Web UI of Spark.

$ kubectl create -f spark/spark-ui-proxy/deployment.yaml
$ kubectl create -f spark/spark-ui-proxy/service.yaml

5. Installation confirmation

5.1. Kubernetes management

Check the deployment and service via kubectl commands

$ kubectl get podNAME                              READY   STATUS      RESTARTS   AGE
spark-master-controller-ggzvf     1/1     Running     0          67s
spark-ui-proxy-controller-7fs7w   1/1     Running     0          13s
spark-worker-controller-lzb59     1/1     Running     0          28s
spark-worker-controller-tdj9k     1/1     Running     0          28s$ kubectl get svc
NAME                  TYPE           CLUSTER-IP       EXTERNAL-IP  PORT(S)               AGE
spark-master          ClusterIP      10.109.128.189   <none>        7077/TCP,8080/TCP     5m2s
spark-master-headless ClusterIP      None             <none>        <none>                5m2s
spark-ui-proxy        LoadBalancer   10.97.241.116    <pending>     80:31436/TCP          4m13s

5.2. Web UI

Check the address of minikube by the command

$ minikube ip
192.168.99.100

Open web browser and access the address: 192.168.99.100:31436 in which 31436 is the port of Spark UI Proxy service.

5.3. Test a program in pyspark command line

Access the master node and start pyspark with these commands.

$ kubectl exec -it spark-master-controller-ggzvf /bin/bash
root@spark-master-hostname:/# pyspark

However, you may get this error

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
...

An easy solution is to use Hadoop’s ‘classpath’ command.

root@spark-master-hostname:/# export SPARK_DIST_CLASSPATH=$(hadoop classpath)
root@spark-master-hostname:/# pyspark

Try a simple program

words = 'the quick brown fox jumps over the\
        lazy dog the quick brown fox jumps over the lazy dog'
seq = words.split()
# sc is SparkContext that is created with pyspark
data = sc.parallelize(seq)
counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect()
dict(counts)
sc.stop()

This result will be shown

{'lazy': 2, 'fox': 2, 'jumps': 2, 'over': 2, 'the': 4, 'brown': 2, 'quick': 2, 'dog': 2}

5.4. Test with Jupyter

If there is JupyterHub or notebook in Kubernetes cluster, open a notebook and start coding. You will need to connect to the Spark master and set driver host be the notebook’s address so that the application can run properly. Note that the environment variables SPARK_MASTER_SERVICE_HOST and SPARK_MASTER_SERVICE_PORT are created by Kubernetes corresponding to the spark-master service.