Deployment of Standalone Spark Cluster on Kubernetes
Spark is a well-known engine for processing big data. As a first step to learn Spark, I will try to deploy a Spark cluster on Kubernetes in my local machine.
A Standalone Spark cluster consists of a master node and several worker nodes. There are several ways to deploy a Spark cluster. I prefer Kubernetes because it is a super convenient way to deploy and manage containerized applications.
In this post, I will deploy a Standalone Spark cluster on a single-node Kubernetes cluster in Minikube. The Spark master and workers are containerized applications in Kubernetes. However, in this case, the cluster manager is not Kubernetes. It is Standalone, a simple cluster manager included with Spark that makes it easy to set up a cluster.
1. Spark cluster overview
Apache Spark is a unified analytics engine for large-scale data processing. It achieves high performance for both batch and streaming data and offers high-level operations that can be used interactively from Scala, Python, R and SQL.
This part gives a short overview of how Spark runs on cluster. There are some components involved when a Spark application is launched.
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for the application. Next, it sends the application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
More detail is at: https://spark.apache.org/docs/latest/cluster-overview.html
2. Spark on Kubernetes
In this post, Spark master and workers are like containerized applications in Kubernetes. They are deployed in Pods and accessed via Service objects.
2.1. Prerequisite
The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs.
Minikube: a tool that runs a single-node Kubernetes cluster in a virtual machine on your personal computer.
Docker: a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and deploy it as one package.
minikube
can be installed following the instruction here. From Spark documentation, it notes that the default minikube configuration is not enough for running Spark applications and recommends 3 CPUs and 4g of memory to be able to start a simple Spark application with a single executor.
Start minikube with the memory and CPU options
$ minikube start --driver=virtualbox --memory 8192 --cpus 4
2.2. Some concept
Pods are the smallest deployable units of computing that can be created and managed in Kubernetes. A Pod (as in a pod of whales or pea pod) is a group of one or more containers (such as Docker containers), with shared storage/network, and a specification for how to run the containers.
A ReplicationController ensures that a specified number of pod replicas are running at any one time. In other words, a ReplicationController makes sure that a pod or a homogeneous set of pods is always up and available.
Service is an abstraction which defines a logical set of Pods and a policy by which to access them (sometimes this pattern is called a micro-service).
3. Docker images
All the source code is at: https://github.com/KienMN/Standalone-Spark-on-Kubernetes
3.1. Spark container
The first step is to build a Docker image for Spark master and workers. The source code along with Dockerfile
is here: https://github.com/KienMN/Standalone-Spark-on-Kubernetes/tree/master/images/spark-standalone
We build the image and push it to the Dockerhub (or any Docker registry)
$ docker build . -t {dockerhub-username}/{image-name}:{image-tag}
$ docker push {dockerhub-username}/{image-name}:{image-tag}
3.2. Spark ui proxy
Spark comes with its own Web UI. However, accessing Web UI of Spark master, workers or drivers can be burdensome due to the complexity of network configuration behind Kubernetes.
Spark UI Proxy is a solution to reduce the burden to accessing web UI of Spark on different pods.
Dockerfile is available here https://github.com/KienMN/Standalone-Spark-on-Kubernetes/tree/master/images/spark-ui-proxy
Use the same commands above to build and push images to the Docker hub (or any Docker registry)
4. Deploy Spark cluster
4.1. Deploy Spark master
Deploy the Spark master with controller.yaml
file.
kind: ReplicationController
apiVersion: v1
metadata:
name: spark-master-controller
spec:
replicas: 1
selector:
component: spark-master
template:
metadata:
labels:
component: spark-master
spec:
hostname: spark-master-hostname
subdomain: spark-master-headless
containers:
- name: spark-master
image: kienmn97/test_spark_kubernetes:2.4.5
imagePullPolicy: Always
command: ["/start-master"]
ports:
- containerPort: 7077
- containerPort: 8080
resources:
requests:
cpu: 100m
I will deploy 1 pod for Spark master and expose port 7077 (for service to listen on) and 8080 (for web UI). I also specify selector to be used in Service.
$ kubectl create -f spark/spark-master/controller.yaml
For the Spark master nodes to be discoverable by the Spark worker nodes, we’ll also need to create a headless service. The configuration is in service.yaml
file.
kind: Service
apiVersion: v1
metadata:
name: spark-master-headless
spec:
ports:
clusterIP: None
selector:
component: spark-master
---
kind: Service
apiVersion: v1
metadata:
name: spark-master
spec:
ports:
- port: 7077
targetPort: 7077
name: spark
- port: 8080
targetPort: 8080
name: http
selector:
component: spark-master
Create these services.
$ kubectl create -f spark/spark-master/service.yaml
4.2. Deploy Spark worker
Deploy the Spark worker with configuration in controller.yaml
file.
$ kubectl create -f spark/spark-worker/controller.yaml
4.3. Start spark ui proxy
Finally, create deployment and service for Spark UI Proxy to allow easy access to Web UI of Spark.
$ kubectl create -f spark/spark-ui-proxy/deployment.yaml
$ kubectl create -f spark/spark-ui-proxy/service.yaml
5. Installation confirmation
5.1. Kubernetes management
Check the deployment and service via kubectl
commands
$ kubectl get podNAME READY STATUS RESTARTS AGE
spark-master-controller-ggzvf 1/1 Running 0 67s
spark-ui-proxy-controller-7fs7w 1/1 Running 0 13s
spark-worker-controller-lzb59 1/1 Running 0 28s
spark-worker-controller-tdj9k 1/1 Running 0 28s$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
spark-master ClusterIP 10.109.128.189 <none> 7077/TCP,8080/TCP 5m2s
spark-master-headless ClusterIP None <none> <none> 5m2s
spark-ui-proxy LoadBalancer 10.97.241.116 <pending> 80:31436/TCP 4m13s
5.2. Web UI
Check the address of minikube
by the command
$ minikube ip
192.168.99.100
Open web browser and access the address: 192.168.99.100:31436
in which 31436
is the port of Spark UI Proxy service.
5.3. Test a program in pyspark command line
Access the master node and start pyspark
with these commands.
$ kubectl exec -it spark-master-controller-ggzvf /bin/bash
root@spark-master-hostname:/# pyspark
However, you may get this error
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
...
An easy solution is to use Hadoop’s ‘classpath’ command.
root@spark-master-hostname:/# export SPARK_DIST_CLASSPATH=$(hadoop classpath)
root@spark-master-hostname:/# pyspark
Try a simple program
words = 'the quick brown fox jumps over the\
lazy dog the quick brown fox jumps over the lazy dog'
seq = words.split()
# sc is SparkContext that is created with pyspark
data = sc.parallelize(seq)
counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect()
dict(counts)
sc.stop()
This result will be shown
{'lazy': 2, 'fox': 2, 'jumps': 2, 'over': 2, 'the': 4, 'brown': 2, 'quick': 2, 'dog': 2}
5.4. Test with Jupyter
If there is JupyterHub or notebook in Kubernetes cluster, open a notebook and start coding. You will need to connect to the Spark master and set driver host be the notebook’s address so that the application can run properly. Note that the environment variables SPARK_MASTER_SERVICE_HOST
and SPARK_MASTER_SERVICE_PORT
are created by Kubernetes corresponding to the spark-master
service.
You can click the name of application to see the UI of the spark.
References
- Spark Cluster mode overview: https://spark.apache.org/docs/latest/cluster-overview.html
- Running Spark on Kubernetes: https://spark.apache.org/docs/latest/running-on-kubernetes.html
- Deploying Spark on Kubernetes: https://testdriven.io/blog/deploying-spark-on-kubernetes/
- Complete guide to deploy Spark on Kubernetes: https://developer.sh/posts/spark-kubernetes-guide
- Error to start pre-built spark-master when slf4j is not installed: https://stackoverflow.com/questions/32547832/error-to-start-pre-built-spark-master-when-slf4j-is-not-installed
- Spark UI Proxy: https://github.com/aseigneurin/spark-ui-proxy