<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Enhancement proposals on DRA Driver for NVIDIA GPUs</title><link>https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/contribute/proposals/</link><description>Recent content in Enhancement proposals on DRA Driver for NVIDIA GPUs</description><generator>Hugo</generator><language>en</language><atom:link href="https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/contribute/proposals/index.xml" rel="self" type="application/rss+xml"/><item><title>NNNN — Template</title><link>https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/contribute/proposals/nnnn-template/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://deploy-preview-1127--dra-driver-nvidia-gpu.netlify.app/contribute/proposals/nnnn-template/</guid><description>&lt;!--
Before writing this, pause and check:

- Does this change really belong in this driver? A common outcome for
 proposals here is "wrong layer" — the change really belonged in upstream
 Kubernetes, in the device plugin, or in the GPU operator. If you're not
 sure, open a discussion issue first and link it here.
- If you can't yet write a one-paragraph release note for this change, you
 aren't ready to propose it.
- Small changes (a flag, a config knob, a local refactor) go straight to
 a PR. See docs/proposals/README.md for when a proposal is actually
 needed.
--&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Field&lt;/th&gt;
 &lt;th&gt;Value&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Status&lt;/td&gt;
 &lt;td&gt;provisional&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Authors&lt;/td&gt;
 &lt;td&gt;@your-handle&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Created&lt;/td&gt;
 &lt;td&gt;YYYY-MM-DD&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Related issues&lt;/td&gt;
 &lt;td&gt;#…&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="summary"&gt;Summary&lt;/h2&gt;
&lt;!--
One paragraph. Should read like the release note users will see on merge —
write this first. If you can't, you don't understand the change yet.
--&gt;
&lt;h2 id="motivation"&gt;Motivation&lt;/h2&gt;
&lt;h3 id="who-is-asking-for-this-and-why"&gt;Who is asking for this, and why?&lt;/h3&gt;
&lt;!--
Name a concrete workload, a named downstream consumer (for example
KubeVirt, LWS, JobSet, Slurm, DCGM-Exporter), or a specific user class.
Abstract motivation without a named use case is a common reason proposals
stall.
--&gt;
&lt;h3 id="goals"&gt;Goals&lt;/h3&gt;
&lt;!--
Each goal should be measurable enough to imply a test. "How will we know
this has succeeded?" is the question reviewers will ask.
--&gt;
&lt;h3 id="non-goals"&gt;Non-goals&lt;/h3&gt;
&lt;!--
Explicit fence. What are you NOT proposing, even if related? This is where
you prevent scope creep during review.
--&gt;
&lt;h2 id="why-this-belongs-in-the-nvidia-dra-driver"&gt;Why this belongs in the NVIDIA DRA driver&lt;/h2&gt;
&lt;!--
Required. State why this change lives here and not in:
- upstream Kubernetes / DRA core
- the device plugin
- the GPU operator
- a CLI or external tool
If any part of this requires an upstream Kubernetes change, link the
blocking issue or discussion.
--&gt;
&lt;h2 id="proposal"&gt;Proposal&lt;/h2&gt;
&lt;h3 id="user-facing-example"&gt;User-facing example&lt;/h3&gt;
&lt;!--
Required. Show the concrete surface users will touch: a ResourceClaim
manifest, a CRD sample, a CLI invocation, a Helm values snippet. If this
feels verbose or repetitive, that's a signal the UX needs rework before
the internals do.
--&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c"&gt;# example&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="affected-components"&gt;Affected components&lt;/h3&gt;
&lt;!-- Check all that apply. Matches the components dropdown on the feature request issue form. --&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; &lt;code&gt;api/&lt;/code&gt; — CRDs, CRD fields, or ResourceClaim shape&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; &lt;code&gt;gpu-kubelet-plugin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; &lt;code&gt;compute-domain-kubelet-plugin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; &lt;code&gt;compute-domain-controller&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; &lt;code&gt;compute-domain-daemon&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; admission webhook&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Helm chart (&lt;code&gt;deployments/helm&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; CDI spec generation&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Metrics&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Kubelet-plugin checkpoint schema&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Documentation&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; CI / testing&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="authoritative-state-owner"&gt;Authoritative state owner&lt;/h3&gt;
&lt;!--
For any new persistent or runtime state: which single component is
authoritative for writes? How do the others read it? Reviewers will push
back on state split across plugin + controller + daemon without a clear
owner.

If you haven't worked in this codebase before, the top-level README and
the directories under `cmd/` describe what each component does.
--&gt;
&lt;h3 id="smallest-valuable-slice"&gt;Smallest valuable slice&lt;/h3&gt;
&lt;!--
What is the smallest piece that could ship first and still be useful?
Large PRs routinely get asked to split — plan the decomposition now.
--&gt;
&lt;h2 id="design"&gt;Design&lt;/h2&gt;
&lt;h3 id="api-changes"&gt;API changes&lt;/h3&gt;
&lt;!--
List new or changed CRD fields, flags, annotations, Helm values. For each
new field, explicitly label it "user-facing" or "implementation detail."
Default to NOT exposing speculative fields — once users depend on them,
removal is painful.
--&gt;
&lt;h3 id="feature-gate--graduation"&gt;Feature gate &amp;amp; graduation&lt;/h3&gt;
&lt;!--
Does this introduce a feature gate? If yes:
- Target stability at ship time (Alpha / Beta / Stable).
- If not Stable, what does graduation to the next level require? Use the
 project's three criteria (see docs/proposals/README.md): feature
 completeness, interoperability (name the adjacent features), stability
 and soak time.
- Which existing feature gates is this mutually exclusive with, or does it
 compose with?
If no gate, state what the change is guarded by (flag, config, chart
value) or "none — ships on by default."
--&gt;
&lt;h3 id="upgrade--downgrade"&gt;Upgrade &amp;amp; downgrade&lt;/h3&gt;
&lt;!--
What happens if a cluster upgrades (or rolls back) the driver while claims
are in flight?
- If the kubelet-plugin checkpoint schema changes: use `omitempty` on new
 fields and describe the upgrade test.
- Helm value migrations, CRD conversion, RBAC changes.
- Rolling restart of controller replicas — does anything get lost?
--&gt;
&lt;h3 id="environment-floor"&gt;Environment floor&lt;/h3&gt;
&lt;!--
- Minimum NVIDIA driver version.
- GPU generations supported (A100 / H100 / B200 / GB200 / L40 / …).
- MIG / NVLink / IMEX / fabric requirements.
- Minimum Kubernetes version.
- Required or assumed kubelet/controller feature gates.
--&gt;
&lt;h3 id="test-plan"&gt;Test plan&lt;/h3&gt;
&lt;!--
Reviewers will not approve without a concrete plan across more than unit
tests. Name the specific files/jobs where you can.
- Unit: …
- Integration / controller tests: …
- BATS (`tests/bats`): …
- Mock NVML CI (`hack/ci/mock-nvml`): can this be exercised on CPU-only
 runners? If no, why not?
- Lambda e2e on real GPUs: which `pull-dra-driver-nvidia-gpu-*` job, and
 which scenario (MIG / time-slicing / ComputeDomain / …)?
--&gt;
&lt;h2 id="risks"&gt;Risks&lt;/h2&gt;
&lt;!--
Think broadly. Concerns reviewers consistently raise here:

- Races between kubelet prepare/unprepare and controller/daemon state.
- Force-deleted pods, node reboots, or daemon restarts mid-claim.
- Multi-node coordination for ComputeDomains (SSA mutation cache, DNS
 index collisions, leader election under partition).
- Privilege surface: privileged containers, /dev mounts, NVML access,
 nvidia-container-runtime interactions.
- Fail-fast preservation: does the process still exit non-zero on
 well-defined fatal conditions after this change?
--&gt;
&lt;h2 id="alternatives"&gt;Alternatives&lt;/h2&gt;
&lt;!--
What did you rule out, and why? Include at minimum:

- What you tried with existing primitives (CEL selectors, `matchAttribute`,
 current CRD knobs, Helm values) and why they were insufficient.
- Adjacent prior art in NVIDIA repos — go-nvlib, dra-example-driver,
 sandbox-device-plugin, nvml-mock — and whether any of it can be reused.
- Relevant upstream Kubernetes work, even if only to state "considered,
 not applicable because …".
--&gt;
&lt;h2 id="drawbacks"&gt;Drawbacks&lt;/h2&gt;
&lt;!--
Steelman the case against this change. If you can't find one, the proposal
isn't ready.
--&gt;
&lt;h2 id="open-questions"&gt;Open questions&lt;/h2&gt;
&lt;!--
Anything you want explicit maintainer input on before implementation.
These are the conversations worth having here rather than in code review.
--&gt;</description></item></channel></rss>