@PyTorch: At #PyTorchCon Europe 2026, @ezyang (@Meta) explains why many developers find tensor parallelism difficult to work with…

X AI KOLs Following 05/14/26, 08:16 PM News

pytorch tensor-parallelism spmd pre-compilation distributed-training type-checking pytorchcon

Summary

At PyTorchCon Europe 2026, Edward Yang explains PyTorch's new pre-compilation support for distributed training and SPMD type system to help developers write correct tensor parallelism code, addressing common pitfalls in gradient correctness.

At #PyTorchCon Europe 2026, @ezyang (@Meta) explains why many developers find tensor parallelism difficult to work with and how PyTorch is exploring SPMD types to catch mistakes through type checking. Watch the full keynote: https://youtu.be/xvNh5F9t-d4?si=zf4P0RFhNB6tx6zh…

Original Article

View Cached Full Text

Cached at: 05/15/26, 07:02 AM

TL;DR: Edward, a core PyTorch maintainer, presented at PyTorchCon Europe 2026 on torch.compile pre-compilation support (solving slow and inconsistent JIT compilation in distributed training) and the SPMD type system (helping developers write correct tensor parallelism code and ensure correct gradients).

Current State of PyTorch Ecosystem

As of early March 2026, PyTorch’s GitHub stars and contributions continue to grow. Over the past 12 months, it received 15,000 contributions to the core library and added 1,250 new contributors. PyTorch is used by 90% of research projects and 90% of open-source projects at top machine learning conferences. Notably, usage of dtensor and device mesh is steadily increasing, with over a thousand open-source projects adopting these new features, indicating strong community demand for distributed training optimizations.

Pre-compilation Support

Why Pre-compilation?

Currently, torch.compile uses a just-in-time (JIT) compilation mechanism — each time before running, every node independently executes the compilation process. This introduces two problems in distributed training:

Slowness: Every time a job is restarted, recompilation is required; cache warm-up takes a long time, and cache lookups must fully trace through Dynamo.
Inconsistent results: Different nodes may execute different paths. For example, dynamic shapes are common in recommendation systems — one node may recompile because it sees a different shape distribution, causing other nodes to time out while waiting. Even if the compiler is deterministic, different graphs may lead to different optimization choices, and incorrectly reordering communication operations can cause deadlocks.

Pre-compilation Approach

Core idea: Compile once upfront, then distribute the compilation artifacts to all nodes. At runtime, no compilation occurs, ensuring all nodes execute the same code.

Two issues need to be addressed:

Full model capture: To compile ahead of time, we must know everything that needs to be compiled. Providing a full computation graph makes this much simpler.
Compile once, run anywhere: Code must not depend on the specific process rank; it must be written in a generic way.

Current Progress

Torch Titan now includes a Graph Trainer, specifically designed for the full model capture workflow. It has been proven to achieve bit-level equivalence with Llama 3, with perfectly matching performance. The new API pre_compile_main accepts training parameters and dumps compilation artifacts to a directory. During actual training, you only need to specify the artifact location and use them directly without retracing the model.

Next steps: Ensure it works for all models in Torch Titan, while also supporting cases where different nodes need different behavior (e.g., FSDP with uneven parameter sharding, or logging only on one node). The relevant PR was just merged into Torch Titan last week.

SPMD Types (Single Program Multiple Data Type System)

What Problem Does It Solve?

In traditional PyTorch, communication operators (e.g., allreduce) lack differentiable autograd functions. Frameworks like Megatron have custom autograd, but writing them is cryptic — for example, an f function that does nothing in the forward pass and performs allreduce in the backward. If you forget to call the correct communication function, you get silently incorrect gradients.

While dtensor guarantees correct gradients, its global semantics frustrate some advanced users: “I want to write programs with local semantics, using ordinary tensors and communication operators, without pretending parallelism doesn’t exist.” Additionally, dtensor can have higher overhead in some patterns.

Design of SPMD Types

SPMD types describe how a tensor (data) is distributed across nodes, but without requiring knowledge of how to stitch it back together (called local SPMD). The main types are:

varying: Tensors differ across nodes. No explanation of how to combine; just that they differ.
partial: Tensors not only differ, but must be reduced to obtain the true value. Supports linear operations (because they distribute over sums), but not non-linear ones.
invariant: Values are exactly the same on all nodes in the forward pass, and also the same in the backward pass.

Value Proposition

On Megatron-style code that explicitly uses communication operators, you only need to add type annotations to inputs and weights, and use special communication operators that propagate types. A program that passes type checking is guaranteed to produce correct gradients. If you forget to call some strange function, the type checker will warn you, indicating where a communication operator is missing. Type checking is entirely optional, has no runtime overhead, and can be run in unit tests. All existing communication patterns work as-is.

This idea is inspired by JAX’s sharding in types (especially the undocumented reduced/unreduced types) and has been adapted to the PyTorch ecosystem.

Source: @PyTorch: At #PyTorchCon Europe 2026, @ezyang (@Meta) explains why many developers find tensor parallelism difficult to work with… (https://www.youtube.com/watch?si=zf4P0RFhNB6tx6zh&v=xvNh5F9t-d4&feature=youtu.be)

@PyTorch: At #PyTorchCon Europe 2026, @ezyang (@Meta) explains why many developers find tensor parallelism difficult to work with…

Current State of PyTorch Ecosystem

Pre-compilation Support

Why Pre-compilation?

Pre-compilation Approach

Current Progress

SPMD Types (Single Program Multiple Data Type System)

What Problem Does It Solve?

Design of SPMD Types

Value Proposition

Similar Articles

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

@yukangchen_: Excited to share our new blog: Scaling Video Training with Parallelism https://research.nvidia.com/labs/eai/blogs/scali…

@PyTorch: New on the PyTorch Foundation blog: @AMD and @Meta contributors share how PyTorch Monarch was brought to AMD Instinct G…

@0xkozue: https://x.com/0xkozue/status/2072607035624247732

@ZhihuFrontier: GPU programming changed because Tensor Cores became too fast to feed Zhihu contributor THU-PACMAN实验室 shared a sharp bre…

Submit Feedback

Similar Articles

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

@yukangchen_: Excited to share our new blog: Scaling Video Training with Parallelism https://research.nvidia.com/labs/eai/blogs/scali…

@PyTorch: New on the PyTorch Foundation blog: @AMD and @Meta contributors share how PyTorch Monarch was brought to AMD Instinct G…

@0xkozue: https://x.com/0xkozue/status/2072607035624247732

@ZhihuFrontier: GPU programming changed because Tensor Cores became too fast to feed Zhihu contributor THU-PACMAN实验室 shared a sharp bre…