ComputeDomains

How compute-domain.nvidia.com provisions ephemeral Multi-Node NVLink fabrics via IMEX.

A ComputeDomain is a custom resource that sets up a group of nodes to run a multi-node workload using NVLink fabric. It is used to enable GPU memory sharing across nodes in hardware that supports Multi-Node NVLink (MNNVL), such as GB200 NVL72 or H100 NVLink configurations.


How it works

Creating a ComputeDomain triggers the following sequence:

  1. The compute-domain-controller watches for new ComputeDomain resources and creates a per-domain DaemonSet.
  2. Each daemon pod in that DaemonSet runs nvidia-imex, which manages the NVLink fabric connection on its node.
  3. Each daemon publishes its IP address, clique membership, and readiness via a ComputeDomainClique CR in the driver namespace.
  4. The compute-domain-controller also creates a ResourceClaimTemplate per channel, making IMEX channels available for workload pods to claim.
  5. When a workload pod claims a channel, the compute-domain-kubelet-plugin injects the IMEX channel device (/dev/nvidia-caps-imex-channels/chan*) and the IMEX socket mount (/imexd) into the container.

For the full sequence diagram, see Architecture › ComputeDomain flow.


Prerequisites

See Prerequisites for hardware and software requirements, including the ComputeDomain-specific requirements for Multi-Node NVLink hardware, GPU Feature Discovery, and nvidia-imex service configuration.


Get started

To create a ComputeDomain and run a workload that claims an IMEX channel, see the ComputeDomain workloads guide.

Last modified May 13, 2026: Add hugo and docsy site (a07c5672)