@jino_rohit: https://x.com/jino_rohit/status/2067620031517860243

X AI KOLs Timeline News

Summary

Explains the communication model for multi-GPU systems, covering the trade-off between latency and bandwidth, and compares MST and Ring algorithms for collective operations like broadcast.

https://t.co/KJST9IHg93
Original Article
View Cached Full Text

Cached at: 06/18/26, 06:20 PM

Collective Communication for Multiple GPUs

As you scale your training or inference across multiple GPUs, you need to find a way to communicate across these GPUs. Libraries like NCCL specialized in providing optimized primitives to communicate across GPUs and these are called collective communication.

Now the type and volume of message shared across the GPUs differ a lot in training vs inference but the primitives remain the same. In this post, we will cover all the important collective primitives you need to start scaling to multiple GPUs!

Communication Model

The total time to send a message across GPUs can be modeled as:

total time=α+nβ,β=1B\text{total time} = \alpha + n\beta, \quad \beta = \frac{1}{B}

where:

  • α (alpha) is the fixed overhead per communication like the time to set up the connection, handshake etc. It does not depend on how much data you send.

  • β (beta) is the time per byte. it is the inverse of bandwidth (β=1/B).

  • n is the message size in bytes.

  • B is the bandwidth of the link (e.g. 900 GB/s for NVLink).

Remember this line: α is a fixed cost you pay once per message, and nβ grows with the message size.

Here’s an example that should help you understand this more clearly. On an NVIDIA H100 with NVLink:

  • α ≈ 10 μs (the setup cost)

  • B ≈ 900 GB/s, so β ≈ 1.1 picoseconds per byte

Now compare two scenarios:

  • Small message (1 KB): α = 10 μs, nβ = 1024 × 1.1 ps ≈ 1.1 ns. The α term is 10,000x larger. The fixed overhead dominates compared to the actual data transfer.

  • Large message (1 GB): α = 10 μs, nβ = 1e9 × 1.1 ps ≈ 1.1 ms. The nβ term is 100x larger. The bandwidth dominates, and the setup cost is negligible.

This tradeoff drives which algorithm you pick. When messages are small, you want to minimize the number of communication rounds (each round costs an α). When messages are large, you want to keep the bus fully utilized by pipelining data - the fixed overhead per round barely matters.

Communication Algorithms

There are two major algorithms to implement the communication algorithms - MST and Ring algorithm.

Minimum Spanning Tree (MST)

MST algorithm is designed to be very efficient for smaller messages, ie: for models that prioritize low latency. It uses a spanning tree structure to enforce minimal rounds of data transfer but does not fully utilize the available bandwidth. It assumes a network where each node can communicate with only one other node at a time.

Don’t worry if this doesn’t make sense yet, we will look at an example comparing both MST and ring and try to understand the tradeoffs shortly.

Ring

In ring algorithm, each GPU is organized in a ring like fashion and communicates only with its two neighbors at once.

  • Message Split - A large message is split into equal N chunks. Each of the P GPUs in the ring gets N/P data.

  • Data Transfer - Each GPU moves its chunk to its next immediate neighbor. Each GPU receives a chunk from the left neighbor and sends a chunk to the right neighbor.

  • Efficiency - Every GPU performs computation simultaneously and bandwidth is always occupied.

Now as promised, lets look at an example where we want to send a message from one GPU to all the other GPUs in the node.

Broadcast in MST

Here we have a setup of 6 GPUs trying to send its data to all the others GPUs in the group.

  • Round 1 - GPU 0 sends to GPU 1 and 2.

  • Round 2 - GPU 1 sends to GPU 3.

  • Round 3 - GPU 2 sends to GPU 4 and 5.

  • Total Rounds of Communication - log₂(6) ≈ 2.585 rounds

Broadcast in Ring

Round 1 - GPU 0 sends to GPU 1. Round 2 - GPU 1 sends to GPU 2. Round 3 - GPU 2 sends to GPU 3. Total Rounds of Communication = N - 1 = 3 rounds

You can see how MST immediately optimizes for latency where the rounds of communication are lesser compared to ring algorithms. Ring algorithms on the other hand focus on improving the bandwidth utilization by chunking its data and always having every single node performing something.

When it comes to ML systems both for training and inference setup, most of the messages are bandwidth bound, because we have to move around massive weights, activations, tensor shards etc.

Cool, now that we have the intuition locked down, it’s time to understand the collective communications.

Broadcast Broadcast sends data from one GPU (the root) to every other GPU in the group. Every GPU ends up with an identical copy of the data. We already saw how this works with MST and ring above.

In training, broadcast is used to distribute model parameters at the start so all replicas start from the same initial state.

Reduce Reduce takes a tensor from every GPU and applies an element-wise operation (sum, mean, min, max) to produce a single result tensor on one root GPU.

This is used to aggregate loss values across GPUs, or to sum gradients before applying the optimizer. The result lives on a single GPU.

Scatter Scatter splits a tensor held by one GPU into equal chunks and sends each chunk to a different GPU. Each GPU ends up with a unique piece of the original data - no two GPUs hold the same chunk.

In data parallelism, scatter is the natural way to distribute different micro-batches of input data across GPUs.

Gather Gather is the inverse of scatter. Every GPU sends its tensor to a single root GPU, which concatenates them into one larger tensor.

You typically use gather to collect model outputs or activations on rank 0 for checkpointing, logging, or evaluation.

Allgather Allgather is like gather followed by broadcast. Every GPU sends its data to all other GPUs, and every GPU ends up with the complete concatenated set of data from all GPUs.

This is critical in tensor parallelism where each GPU holds a shard of a tensor but the full tensor is needed for certain operations like attention. Allgather lets every GPU reconstruct the full tensor without a central bottleneck.

Reduce Scatter Reduce Scatter combines reduce and scatter into a single operation. The tensors from all GPUs are reduced element-wise, and the result is split into chunks with each GPU receiving one chunk.

Allreduce Allreduce is one of the most important collective communication in distributed training. It reduces tensors from all GPUs (sum or mean) and broadcasts the result so every GPU ends up with the exact same reduced tensor.

In DDP, allreduce is the backbone of gradient synchronisation. Every GPU computes local gradients, and allreduce ensures every replica applies the same weight update.

Wrapping Up

With all of this you should be able to understand the different collective primitives and the tradeoffs of using each operation and scale your experiments beyond a single GPU.

The key takeaway is that most ML workloads are bandwidth-bound, which is why ring-based algorithms dominate production systems. MST-based algorithms are useful when latency is the bottleneck and message sizes are small.

In the next couple of blogs, I will focus more on distributed techniques and more production ready libraries like torchtitan.

Similar Articles

https://www.youtube.com/watch?v=aE0onltJlOo

YouTube AI Channels

This lecture introduces the flexible evolution of GPU architecture as a SIMD (vector/array) processor, discusses data parallelism, memory bank grouping, bank conflicts, serial bottlenecks, and the history of SIMD instructions (such as MMX), emphasizing how GPUs leverage data parallelism and deal with serial bottlenecks.

Joing all GPUs to train a community model

Reddit r/LocalLLaMA

A discussion about pooling GPUs from a community to train a massive AI model, questioning the feasibility and existing projects despite known bottlenecks like latency and weight poisoning.

Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

Reddit r/MachineLearning

Author describes building FlashRT, a CUDA-first inference runtime that rewrites model inference paths with C++/CUDA kernels to address bottlenecks beyond GEMM for small-batch/realtime workloads, achieving significant latency improvements on Jetson Thor and RTX 5090. The article discusses lessons on precision (FP8 helpful, FP4 mixed) and the need to bypass generic runtimes for realtime inference.