Concepts
How the driver is put together and the model it exposes to users.
The DRA Driver for NVIDIA GPUs is a Kubernetes Dynamic Resource Allocation (DRA) driver that enables flexible GPU allocation and provisioning of multi-node NVLink fabrics for Kubernetes workloads.
It requires Kubernetes 1.32 or later. Starting in Kubernetes 1.34, DRA is enabled by default. On Kubernetes 1.32 and 1.33, the DynamicResourceAllocation feature gate must be enabled.
The driver manages two types of resources:
gpu.nvidia.com) — allocates full GPUs, Multi-Instance GPU (MIG) slices, and Virtual Function I/O (VFIO) passthrough devices with fine-grained sharing and configuration control.compute-domain.nvidia.com) — provisions ephemeral multi-node NVLink fabrics using IMEX, enabling pods on different nodes to share GPU memory at full NVLink bandwidth.The Kubernetes device plugin framework treats hardware resources as opaque countable integers. A workload either gets a whole unit or it doesn’t and there is no way to express sharing, per-workload configuration, capability constraints, or topology requirements within that model.
DRA is a replacement for the device plugin architecture itself, not just a different driver for the same interface. It brings hardware resource management closer to the Persistent Volume model: a workload declares what it needs in a ResourceClaim, and the driver fulfills it. This separation of declaration from consumption enables capabilities that are fundamentally out of scope for the device plugin framework:
Request GPUs, MIG slices, or VFIO passthrough devices in a ResourceClaim, with an optional sharing strategy (time-slicing or Multi-Process Service (MPS)).
Refer to Architecture for how requests are fulfilled.
Create a ComputeDomain resource to provision an ephemeral NVLink fabric across nodes.
Workload pods claim a channel from the domain and receive the IMEX device mounts needed for direct cross-node GPU memory access.
The fabric is torn down automatically when the workload finishes.
ResourceClaim specs.How the driver is put together and the model it exposes to users.
Cluster, node, and tooling requirements before installing the driver.
Helm install steps for the DRA driver.
Upgrading the driver between releases.
Diagnosing common failure modes.
Task-oriented walkthroughs for common workflows.
API, feature gates, and Helm values.
Cleanly removing the driver from a cluster.