NCCL Intra/Inter-Node Communication

References

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

www.arxiv.org/abs/2507.04786

GPUs, SMs, and CUDA Blocks/Threads

GPU (Graphics Processing Unit): The entire computing device. In a multi-GPU setup, each GPU is typically a “rank” in the distributed system.
SM (Streaming Multiprocessor): A core processing unit within a GPU. A modern GPU has many SMs. An SM can execute multiple CUDA warps (groups of 32 threads) concurrently.
CUDA Block: A group of CUDA threads that execute cooperatively on an SM. All threads within a block can synchronize using __syncthreads() (or bar.sync as seen in NCCL) and share data via shared memory.
CUDA Thread: The smallest unit of execution on the GPU.

Is `nworkers` equivalent to SMs receiving individual chunks?

Not directly. A single NCCL kernel launch (which corresponds to one runRing execution on a GPU) typically maps to one or more CUDA blocks, and each CUDA block runs on an SM.

NCCL’s design often leverages multiple CUDA blocks to achieve high bandwidth for large messages. Each CUDA block can run on a different SM. The runRing function (and the Primitives class) handles the logic for a single communication channel or slice of the overall operation.

For a large collective operation (e.g., an All-Reduce on a very large tensor), NCCL will launch multiple CUDA blocks on a single GPU. Each of these blocks might handle a different channel or segment of the data.
Each of these CUDA blocks will have its own set of nthreads and nworkers threads.
So, while nworkers refers to threads within a single block, the overall parallelism on a GPU comes from launching multiple blocks (which then run on multiple SMs) to handle different parts of the data simultaneously.

The idea is that NCCL subdivides the collective into “channels,” and each channel is typically handled by a separate CUDA block (running on its own SM). Within each of those blocks, there’s a distinction between the nworkers (doing the heavy data lifting) and other threads (handling communication overhead).

References

GPUs, SMs, and CUDA Blocks/Threads

Is nworkers equivalent to SMs receiving individual chunks?

Is `nworkers` equivalent to SMs receiving individual chunks?