Concepts on DRA Driver for NVIDIA GPUs

Architecture

Mon, 01 Jan 0001 00:00:00 +0000

This repo ships two independent Kubernetes DRA drivers from one codebase:

gpu.nvidia.com — allocates GPUs, MIG slices, and VFIO passthrough.
compute-domain.nvidia.com — allocates IMEX daemons and channels for Multi-Node NVLink.

Both are delivered by the Helm chart in deployments/helm/dra-driver-nvidia-gpu/.

Components

Binary	Runs as	What it does
`gpu-kubelet-plugin`	DaemonSet, per node	Publishes GPU / MIG / VFIO `ResourceSlice`s and injects CDI on Prepare.
`compute-domain-kubelet-plugin`	Same DaemonSet, per node	Publishes IMEX daemon + channel devices and injects the IMEX mount on Prepare.
`compute-domain-controller`	Cluster Deployment	Watches `ComputeDomain` CRs; spawns a per-CD DaemonSet and the matching `ResourceClaimTemplate`s.
`compute-domain-daemon`	Per-CD DaemonSet, per node	Wraps and supervises `nvidia-imex`; reports peers.
`webhook`	Cluster Deployment	Validates opaque config on `ResourceClaim`s.

graph TB
 subgraph cluster["Cluster-scoped (Deployment)"]
 controller["compute-domain-controller"]
 webhook["webhook"]
 end
 subgraph node["Each GPU Node (DaemonSet)"]
 gpu["gpu-kubelet-plugin"]
 cdp["compute-domain-kubelet-plugin"]
 end
 subgraph percd["Per ComputeDomain (DaemonSet)"]
 daemon["compute-domain-daemon"]
 end
 controller -->|spawns| percd
 controller -->|creates ResourceClaimTemplates| cdp

GPU request flow

Pod → ResourceClaim with a GpuConfig / MigDeviceConfig / VfioDeviceConfig → webhook validates → scheduler binds a device advertised by the GPU plugin → kubelet calls Prepare → plugin writes a CDI spec → runtime injects the GPU into the container.

GPU allocation

Mon, 01 Jan 0001 00:00:00 +0000

The DRA Driver for NVIDIA GPUs exposes three types of GPU resources, each suited to different workload requirements. This page explains what they are, how they differ, and how Kubernetes schedules them.

DeviceClass	Resource type	Use case
`gpu.nvidia.com`	Full GPU	Exclusive or shared access to a single physical GPU
`mig.nvidia.com`	MIG slice	Hardware-isolated partition of a supported GPU
`vfio.gpu.nvidia.com`	VFIO passthrough	Raw GPU access for workloads that manage the driver themselves

Resource types

Full GPUs

A full GPU gives a container exclusive access to a single physical GPU. This is the default allocation mode and requires no additional configuration.

ComputeDomains

Mon, 01 Jan 0001 00:00:00 +0000

A ComputeDomain is a custom resource that sets up a group of nodes to run a multi-node workload using NVLink fabric. It is used to enable GPU memory sharing across nodes in hardware that supports Multi-Node NVLink (MNNVL), such as GB200 NVL72 or H100 NVLink configurations.

How it works

Creating a ComputeDomain triggers the following sequence:

The compute-domain-controller watches for new ComputeDomain resources and creates a per-domain DaemonSet.
Each daemon pod in that DaemonSet runs nvidia-imex, which manages the NVLink fabric connection on its node.
Each daemon publishes its IP address, clique membership, and readiness via a ComputeDomainClique CR in the driver namespace.
The compute-domain-controller also creates a ResourceClaimTemplate per channel, making IMEX channels available for workload pods to claim.
When a workload pod claims a channel, the compute-domain-kubelet-plugin injects the IMEX channel device (/dev/nvidia-caps-imex-channels/chan*) and the IMEX socket mount (/imexd) into the container.

For the full sequence diagram, see Architecture › ComputeDomain flow.