ComputeDomains
How compute-domain.nvidia.com provisions ephemeral Multi-Node NVLink fabrics via IMEX.
A ComputeDomain is a custom resource that sets up a group of nodes to run a multi-node workload using NVLink fabric. It is used to enable GPU memory sharing across nodes in hardware that supports Multi-Node NVLink (MNNVL), such as GB200 NVL72 or H100 NVLink configurations.
How it works
Creating a ComputeDomain triggers the following sequence:
- The
compute-domain-controllerwatches for newComputeDomainresources and creates a per-domain DaemonSet. - Each daemon pod in that DaemonSet runs
nvidia-imex, which manages the NVLink fabric connection on its node. - Each daemon publishes its IP address, clique membership, and readiness via a
ComputeDomainCliqueCR in the driver namespace. - The
compute-domain-controlleralso creates aResourceClaimTemplateper channel, making IMEX channels available for workload pods to claim. - When a workload pod claims a channel, the
compute-domain-kubelet-plugininjects the IMEX channel device (/dev/nvidia-caps-imex-channels/chan*) and the IMEX socket mount (/imexd) into the container.
For the full sequence diagram, see Architecture › ComputeDomain flow.
Prerequisites
See Prerequisites for hardware and software requirements, including the ComputeDomain-specific requirements for Multi-Node NVLink hardware, GPU Feature Discovery, and nvidia-imex service configuration.
Get started
To create a ComputeDomain and run a workload that claims an IMEX channel, see the ComputeDomain workloads guide.