This one is not going to be crazy technical but I haven’t written in a while and thought this might be helpful to many out there.
After containerizing my GPUs workloads in my home lab, I noticed that seemingly randomly my GPU-enabled containers would start throwing the following error (e.g. when running nvidia-smi):
As it turns out, dozens have been reporting the issue in the past couple of years – but never were there a solid solution. It was pointed out that it wasn’t “random” after all, and that running systemctl daemon-reload would instantly trigger the issue, and that the container(s) had to be restarted before the GPU(s) could be used within the containers once again. It all started with Docker 20.10, which then started using cgroup v2 (aka unified cgroup hierarchy) instead of cgroupfs (if enabled on the operating system), and as distros started enabling it by default (e.g. Ubuntu 22.04).
Under the hood, the nvidia-container-runtime runc prestart hook injects the GPU(s)’s userland drivers, char devices and associated cgroups through libnvidia-container.
Unfortunately, the hook does all of this “behind the back” of runc which requires /dev/char symlinks, and a systemd reload reverts that work while re-evaluating all of the cgroups rules. Until Docker adds supports for the Container Device Interface (CDI) (now in CRI-O & containerd) and until the nvidia-container-runtime becomes CDI-aware, we need to work around the issue by setting up permanent links to the char devices.
On Kubernetes with the Nvidia GPU Operator
The Nvidia GPU Operator v22.9.2 released on Jan 30, 2023 now handles the symlinks in their validator pod. If not using the operator, you may refer to the workaround below.
On standalone Docker with the nvidia-ctk
The Nvidia Container Toolkit v1.12.0 released on Feb 6, 2023 provides a new utility to create the char symlinks automatically:
It can be conveniently integrated as a udev rule for persistence across reboots:
If you’re using the Nvidia GPU Driver Container, you must also specify the –driver-root= option pointing to the directory where the driver and device nodes are created.
But..
Unfortunately, for now, we also still need to pass the devices directly into the containers to prevent the issue from happening upon systemd reloads, which arguably reduces portability: