Handling errors while deploying Kubernetes cluster on VM cluster with Calico network
My labmate (선배님) and I are trying to install Kubernetes for research. And we decided to first install it on a Virtual box machines cluster.
Virtual Box Machine cluster
First of all, we created a cluster in Virtual Box application. It includes a master node and 2 slave nodes. They are all running on Ubuntu 18.04. We use a network interface for local connect and a network interface to access to the Internet. My labmate helped me to create so I do not know well, actually.
Install kubeadm toolbox
You can find a comprehensive guide to install kubeadm
provided by official kubernetes website here. Hence I will only mention some steps and errors which I encountered.
To eliminate as many errors as possible, I recommend you to check the requirement carefully provided on the website. Also, make sure the internet connection among nodes, between each node and the internet, the certificates.
Install kubeadm, kubelet and kubectl
We will install these packages on all of the machines:
kubeadm
: the command to bootstrap the cluster.kubelet
: the component that runs on all of the machines in your cluster and does things like starting pods and containers.kubectl
: the command line util to talk to your cluster.kubernetes-cni
: Kubernetes uses CNI (Container Network Interface) as an interface between network providers and Kubernetes networking.
No valid OpenPGP data found
We start by adding key using the commands:
$ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
Sometimes, this command shows the error:
gpg: no valid OpenPGP data found
We can try with wget
and --no-check-certificate
to download the apt-key.gpg
file and add the key file separately.
$ wget --no-check-certificate https://packages.cloud.google.com/apt/doc/apt-key.gpg
$ sudo apt-key add apt-key.gpg
Update repositories issue
We need to add software repositories (sources) to download and install necessary packages for the computers.
$ cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF$ apt-get update
$ apt-get install -y kubelet kubeadm kubectl kubernetes-cni
However, due to addresses redirection, some errors can occurs while calling update command. For example:
W: The repository 'https://apt.kubernetes.io kubernetes-xenial Release' does not have a Release file
Instead, we can specify the exact address as following.
$ cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb http://packages.cloud.google.com/apt/ kubernetes-xenial main
EOF$ apt-get update
$ apt-get install -y kubelet kubeadm kubectl kubernetes-cni
Start the cluster
Now, we can start to use kubeadm command, let initialize our cluster
$ sudo kubeadm init
There are some errors while checking preflight. Some people recommend to ignore preflight checking by adding flag --ignore-preflight-errors=<list-of-errors>
. However, I strongly recommend not to ignore preflight errors in order to debug more easily if something goes wrong. Because, eventually, any error can be the reason for making the Kubernetes cluster not to work properly.
Permission denied while trying to connect to the Docker daemon socket
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/json: dial unix /var/run/docker.sock: connect: permission denied
The solution is to add user into docker group: https://docs.docker.com/engine/install/linux-postinstall/
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
Just do as what it tells us.
Error Swap: running with swap on is not supported
[ERROR Swap]: running with swap on is not supported. Please disable swap
Use this command to turn off swap
$ sudo swapoff -a
x509: certificate signed by unknown authority
Sometimes, you may get this error
Get https://k8s.gcr.io/v2/: x509: certificate signed by unknown authority
The temporary solution is for insecure connection: https://docs.docker.com/registry/insecure/
There are also problems coming from network. Hence, make sure the network connection, the certificates, and check all nodes can connect to the internet.
Deployment cluster with Calico network
Applying Calico pod network
When kubeadm init
finish properly, the output is like that.
$ sudo kubeadm init
...Your Kubernetes control-plane has initialized successfully!To start using your cluster, you need to run the following as a regular user: mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/configYou should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/Then you can join any number of worker nodes by running the following on each as root: kubeadm join 10.244.0.4:6443 --token 1l9bdg.yseidooa0w6q5h66 \
--discovery-token-ca-cert-hash sha256:36cdd42f3bac72c9a22e2a3d40983af10ddd99e5d3b9a13104ed7dbcd9503da2
Just follow the instruction, after creating .kube
folder and copying config, we need to select a pod network to the cluster. In this post, I select Calico. To use Calico, we should initialize cluster with flag --pod-network-cidr
as the following example. You can find a quickstart guide on the Calico website.
$ sudo kubeadm init --pod-network-cidr=192.168.0.0/16
...
$ kubectl apply -f calico.yaml
...
And then, ssh to the slave nodes and join the cluster
kube@slave:$ sudo kubeadm join <control-plane-host>:<control-plane-port> --token <token> --discovery-token-ca-cert-hash sha256:<hash>
If everything works fine, the result will be like this
Debugging deployment errors
Sometimes, there are some errors occurs. Kubernetes provides some commands to view the logs of the deployment which can be used to figure out the errors.
$ kubectl describe/logs -n kube-system
In my case, I could not create calico node properly and get this error:
Calico node 'node-name' is already using the IPv4 address X.X.X.X
The reason is that there are many network interfaces in the cluster, and I should define the interface on which the nodes can find each other. This can be done by specify network interface value for environment variable IP_AUTODETECTION_METHOD
in the yaml
file of the pod network.
Now, the comprehensive command to initialize the cluster as follow:
$ sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --apiserver-advertise-address=<address-of-master-node-on-the-interface-defined-on-IP-AUTODETECTION-METHOD>
...
$ kubectl apply -f calico.yaml
...
Finally, follow the instructions (in reference 2 or 4) to complete deployment.
Reset the cluster
Before reseting the cluster, we need to do some steps
$ sudo kubeadm reset
...
$ # Remove $HOME/.kube directory
...
$ # Remove CNI plugin in /opt/cni/bin and etc/cni/net.d
...
Restart the cluster
Every time the (physical or virtual) cluster restart, the Kubernetes cluster is down. However, we only need to restart kubelet
to restart Kubernetes cluster.
$ # Turn off swap
$ sudo swapoff -a
$ # Restart kubelet
$ sudo systemctl restart kubelet
References
- Install
kubeadm
guide: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/ - Creating a single control-plane cluster with kubeadm: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/
- Redirection issue: https://askubuntu.com/questions/1100800/kubernetes-installation-failing-ubuntu-16-04
- Quickstart for Calico on Kubernetes: https://docs.projectcalico.org/getting-started/kubernetes/quickstart
- Calico-node on worker nodes with ‘CrashLoopBackOff’: https://github.com/projectcalico/calico/issues/2720
- Restart issue: https://stackoverflow.com/questions/51375940/kubernetes-master-node-is-down-after-restarting-host-machine