<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Concepts on DRA Driver for NVIDIA GPUs</title><link>https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/docs/concepts/</link><description>Recent content in Concepts on DRA Driver for NVIDIA GPUs</description><generator>Hugo</generator><language>en</language><atom:link href="https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/docs/concepts/index.xml" rel="self" type="application/rss+xml"/><item><title>Architecture</title><link>https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/docs/concepts/architecture/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/docs/concepts/architecture/</guid><description>&lt;p&gt;This repo ships two independent Kubernetes DRA drivers from one codebase:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;gpu.nvidia.com&lt;/code&gt;&lt;/strong&gt; — allocates GPUs, MIG slices, and VFIO passthrough.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;compute-domain.nvidia.com&lt;/code&gt;&lt;/strong&gt; — allocates IMEX daemons and channels for Multi-Node NVLink.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both are delivered by the Helm chart in &lt;a href="https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/tree/main/deployments/helm/dra-driver-nvidia-gpu"&gt;&lt;code&gt;deployments/helm/dra-driver-nvidia-gpu/&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="components"&gt;Components&lt;/h2&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Binary&lt;/th&gt;
 &lt;th&gt;Runs as&lt;/th&gt;
 &lt;th&gt;What it does&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;gpu-kubelet-plugin&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;DaemonSet, per node&lt;/td&gt;
 &lt;td&gt;Publishes GPU / MIG / VFIO &lt;code&gt;ResourceSlice&lt;/code&gt;s and injects CDI on Prepare.&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;compute-domain-kubelet-plugin&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Same DaemonSet, per node&lt;/td&gt;
 &lt;td&gt;Publishes IMEX daemon + channel devices and injects the IMEX mount on Prepare.&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;compute-domain-controller&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Cluster Deployment&lt;/td&gt;
 &lt;td&gt;Watches &lt;code&gt;ComputeDomain&lt;/code&gt; CRs; spawns a per-CD DaemonSet and the matching &lt;code&gt;ResourceClaimTemplate&lt;/code&gt;s.&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;compute-domain-daemon&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Per-CD DaemonSet, per node&lt;/td&gt;
 &lt;td&gt;Wraps and supervises &lt;code&gt;nvidia-imex&lt;/code&gt;; reports peers.&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;webhook&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Cluster Deployment&lt;/td&gt;
 &lt;td&gt;Validates opaque config on &lt;code&gt;ResourceClaim&lt;/code&gt;s.&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class="mermaid"&gt;graph TB
 subgraph cluster[&amp;#34;Cluster-scoped (Deployment)&amp;#34;]
 controller[&amp;#34;compute-domain-controller&amp;#34;]
 webhook[&amp;#34;webhook&amp;#34;]
 end
 subgraph node[&amp;#34;Each GPU Node (DaemonSet)&amp;#34;]
 gpu[&amp;#34;gpu-kubelet-plugin&amp;#34;]
 cdp[&amp;#34;compute-domain-kubelet-plugin&amp;#34;]
 end
 subgraph percd[&amp;#34;Per ComputeDomain (DaemonSet)&amp;#34;]
 daemon[&amp;#34;compute-domain-daemon&amp;#34;]
 end
 controller --&amp;gt;|spawns| percd
 controller --&amp;gt;|creates ResourceClaimTemplates| cdp&lt;/pre&gt;
&lt;h2 id="gpu-request-flow"&gt;GPU request flow&lt;/h2&gt;
&lt;p&gt;Pod → &lt;code&gt;ResourceClaim&lt;/code&gt; with a &lt;code&gt;GpuConfig&lt;/code&gt; / &lt;code&gt;MigDeviceConfig&lt;/code&gt; / &lt;code&gt;VfioDeviceConfig&lt;/code&gt; → webhook validates → scheduler binds a device advertised by the GPU plugin → kubelet calls Prepare → plugin writes a CDI spec → runtime injects the GPU into the container.&lt;/p&gt;</description></item><item><title>GPU allocation</title><link>https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/docs/concepts/gpu-allocation/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/docs/concepts/gpu-allocation/</guid><description>&lt;p&gt;The DRA Driver for NVIDIA GPUs exposes three types of GPU resources, each suited to different workload requirements. This page explains what they are, how they differ, and how Kubernetes schedules them.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;DeviceClass&lt;/th&gt;
 &lt;th&gt;Resource type&lt;/th&gt;
 &lt;th&gt;Use case&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;gpu.nvidia.com&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Full GPU&lt;/td&gt;
 &lt;td&gt;Exclusive or shared access to a single physical GPU&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;mig.nvidia.com&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;MIG slice&lt;/td&gt;
 &lt;td&gt;Hardware-isolated partition of a supported GPU&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;code&gt;vfio.gpu.nvidia.com&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;VFIO passthrough&lt;/td&gt;
 &lt;td&gt;Raw GPU access for workloads that manage the driver themselves&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id="resource-types"&gt;Resource types&lt;/h2&gt;
&lt;h3 id="full-gpus"&gt;Full GPUs&lt;/h3&gt;
&lt;p&gt;A full GPU gives a container exclusive access to a single physical GPU. This is the default allocation mode and requires no additional configuration.&lt;/p&gt;</description></item><item><title>ComputeDomains</title><link>https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/docs/concepts/compute-domains/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/docs/concepts/compute-domains/</guid><description>&lt;p&gt;A &lt;code&gt;ComputeDomain&lt;/code&gt; is a custom resource that sets up a group of nodes to run a multi-node workload using NVLink fabric. It is used to enable GPU memory sharing across nodes in hardware that supports Multi-Node NVLink (MNNVL), such as GB200 NVL72 or H100 NVLink configurations.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="how-it-works"&gt;How it works&lt;/h2&gt;
&lt;p&gt;Creating a &lt;code&gt;ComputeDomain&lt;/code&gt; triggers the following sequence:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The &lt;code&gt;compute-domain-controller&lt;/code&gt; watches for new &lt;code&gt;ComputeDomain&lt;/code&gt; resources and creates a per-domain DaemonSet.&lt;/li&gt;
&lt;li&gt;Each daemon pod in that DaemonSet runs &lt;code&gt;nvidia-imex&lt;/code&gt;, which manages the NVLink fabric connection on its node.&lt;/li&gt;
&lt;li&gt;Each daemon publishes its IP address, clique membership, and readiness via a &lt;code&gt;ComputeDomainClique&lt;/code&gt; CR in the driver namespace.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;compute-domain-controller&lt;/code&gt; also creates a &lt;code&gt;ResourceClaimTemplate&lt;/code&gt; per channel, making IMEX channels available for workload pods to claim.&lt;/li&gt;
&lt;li&gt;When a workload pod claims a channel, the &lt;code&gt;compute-domain-kubelet-plugin&lt;/code&gt; injects the IMEX channel device (&lt;code&gt;/dev/nvidia-caps-imex-channels/chan*&lt;/code&gt;) and the IMEX socket mount (&lt;code&gt;/imexd&lt;/code&gt;) into the container.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For the full sequence diagram, see &lt;a href="https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/docs/concepts/architecture/#computedomain-flow"&gt;Architecture › ComputeDomain flow&lt;/a&gt;.&lt;/p&gt;</description></item></channel></rss>