NCCL Intra/Inter-Node Communication
References
GPUs, SMs, and CUDA Blocks/Threads
- GPU (Graphics Processing Unit): The entire computing device. In a multi-GPU setup, each GPU is typically a “rank” in the distributed system.
- SM (Streaming Multiprocessor): A core processing unit within a GPU. A modern GPU has many SMs. An SM can execute multiple CUDA warps (groups of 32 threads) concurrently.
- CUDA Block: A group of CUDA threads that execute cooperatively on an SM. All threads within a block can synchronize using
__syncthreads()
(orbar.sync
as seen in NCCL) and share data via shared memory. - CUDA Thread: The smallest unit of execution on the GPU.
Is nworkers
equivalent to SMs receiving individual chunks?
Not directly. A single NCCL kernel launch (which corresponds to one runRing
execution on a GPU) typically maps to one or more CUDA blocks, and each CUDA block runs on an SM.
NCCL’s design often leverages multiple CUDA blocks to achieve high bandwidth for large messages. Each CUDA block can run on a different SM. The runRing
function (and the Primitives
class) handles the logic for a single communication channel or slice of the overall operation.
-
For a large collective operation (e.g., an All-Reduce on a very large tensor), NCCL will launch multiple CUDA blocks on a single GPU. Each of these blocks might handle a different channel or segment of the data.
-
Each of these CUDA blocks will have its own set of
nthreads
andnworkers
threads. -
So, while
nworkers
refers to threads within a single block, the overall parallelism on a GPU comes from launching multiple blocks (which then run on multiple SMs) to handle different parts of the data simultaneously.
The idea is that NCCL subdivides the collective into “channels,” and each channel is typically handled by a separate CUDA block (running on its own SM). Within each of those blocks, there’s a distinction between the nworkers
(doing the heavy data lifting) and other threads (handling communication overhead).