Use NVIDIA GPUs

Last updated on Feb 12, 2026

In this tutorial you will learn whatever how to setup your STACKIT Kubernetes Engine (SKE) cluster to run NVIDIA GPU workloads. For that we will use the NVIDIA GPU Operator to install the GPU driver.

Cluster setup

Currently, the GPU Operator only supports node pools that use Ubuntu. During the creation of a cluster, choose a node pool using the Ubuntu image and select a machine flavor that supports GPUs. See the list of virtual machine flavors for specific flavor names.

If you already have a cluster, then add a new node pool using the Ubuntu image and a GPU machine flavor. Make sure that all GPU node pools use the same Ubuntu image. Once the cluster is created, setup access as described in access a Kubernetes cluster.

Operator Installation

To install the latest version of Helm (read more: Helm Install Documentation), run the following command:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

Set up the GPU Operator repository:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install the GPU Operator in the cluster. Possible customization options are listed in the GPU Operator documentation.

helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator

To customize the GPU driver version, add --set driver.version=<driver-version> to the above command. See the GPU Operator Component Matrix for a list of supported NVIDIA GPU driver versions.

Wait until the driver installation is complete and all GPU Operator DaemonSets are ready. This will take a few minutes.

kubectl get daemonset -n gpu-operator

Verify driver installation

To test the driver installation, you could run the following CUDA test container:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: vectoradd
  name: vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: vectoradd
    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

The nvidia.com/gpu resource limit is required to run containers that use GPUs. Such containers get scheduled to GPU node and are granted access to the GPUs.

You can check the logs of the vectoradd Pod:

kubectl logs vectoradd

For further tests, you can for example deploy the Jupyter Notebook example from NVIDIA.

Optimize Pod scheduling

By default, Kubernetes schedules pods to all available nodes. This means that non-GPU workloads can also be scheduled to GPU nodes. To exclusively use GPU nodes for GPU workloads, use the node pool editing screen to add a nvidia.com/gpu taint with effect NoSchedule to the GPU node pool. When using such a taint, make sure that at least one other node pool without a taint exists.

Then add a corresponding toleration to pod specification of the GPU workloads:

spec:
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists

Operating system updates

During an operating system update, the nodes in a node pool are iteratively replaced with new nodes running an up-to-date operating system. The GPU Operator automatically installs the driver on the new nodes. This takes a few minutes, after which the new nodes are available for GPU workloads.

To minimize disruptions for your workload, you can set a PodDisruptionBudget for your application. This requires that the application runs at least two instances. When using the maxUnavailable setting, you must at least specify a value of 1. A value of 0 will be ignored and results in a forceful upgrade.

There is a small risk that the driver installation might fail after an operating system update. Make sure to configure the maintenance window of your cluster such that you are able to monitor updates.