Skip to content

How to use NVIDIA GPUs

In this tutorial you will learn whatever how to setup your STACKIT Kubernetes Engine (SKE) cluster to run NVIDIA GPU workloads. For that we will use the NVIDIA GPU Operator to install the GPU driver.

Currently, the GPU Operator only supports node pools that use Ubuntu. During the creation of a cluster, choose a node pool using the Ubuntu image and select a machine flavor that supports GPUs. See the list of virtual machine flavors for specific flavor names.

If you already have a cluster, then add a new node pool using the Ubuntu image and a GPU machine flavor. Make sure that all GPU node pools use the same Ubuntu image. Once the cluster is created, setup access as described in access a Kubernetes cluster.

To install the latest version of Helm (read more: Helm Install Documentation), run the following command:

Terminal window
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

Setup the GPU Operator repository:

Terminal window
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install the GPU Operator in the cluster. Possible customization options are listed in the GPU Operator documentation.

Terminal window
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator

To customize the GPU driver version, add --set driver.version=<driver-version> to the above command. See the GPU Operator Component Matrix for a list of supported NVIDIA GPU driver versions.

Wait until the driver installation is complete and all GPU Operator DaemonSets are ready. This will take a few minutes.

Terminal window
kubectl get daemonset -n gpu-operator

To test the driver installation, you could run the following CUDA test container:

Terminal window
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
labels:
run: vectoradd
name: vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: vectoradd
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
resources:
limits:
nvidia.com/gpu: 1
EOF

The nvidia.com/gpu resource limit is required to run containers that use GPUs. Such containers get scheduled to GPU node and are granted access to the GPUs.

You can check the logs of the vectoradd Pod:

Terminal window
kubectl logs vectoradd

For further tests, you can for example deploy the Jupyter Notebook example from NVIDIA.

By default, Kubernetes schedules pods to all available nodes. This means that non-GPU workloads can also be scheduled to GPU nodes. To exclusively use GPU nodes for GPU workloads, use the node pool editing screen to add a nvidia.com/gpu taint with effect NoSchedule to the GPU node pool. When using such a taint, make sure that at least one other node pool without a taint exists.

Then add a corresponding toleration to pod specification of the GPU workloads:

Terminal window
spec:
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists

During an operating system update, the nodes in a node pool are iteratively replaced with new nodes running an up-to-date operating system. The GPU Operator automatically installs the driver on the new nodes. This takes a few minutes, after which the new nodes are available for GPU workloads.

To minimize disruptions for your workload, you can set a PodDisruptionBudget for your application. This requires that the application runs at least two instances. When using the maxUnavailable setting, you must at least specify a value of 1. A value of 0 will be ignored and results in a forceful upgrade.

There is a small risk that the driver installation might fail after an operating system update. Make sure to configure the maintenance window of your cluster such that you are able to monitor updates.