@PyTorch: At #PyTorchCon Europe 2026, @ezyang (@Meta) explains why many developers find tensor parallelism difficult to work with…
Summary
At PyTorchCon Europe 2026, Edward Yang explains PyTorch's new pre-compilation support for distributed training and SPMD type system to help developers write correct tensor parallelism code, addressing common pitfalls in gradient correctness.
View Cached Full Text
Cached at: 05/15/26, 07:02 AM
At #PyTorchCon Europe 2026, @ezyang (@Meta) explains why many developers find tensor parallelism difficult to work with and how PyTorch is exploring SPMD types to catch mistakes through type checking. Watch the full keynote: https://youtu.be/xvNh5F9t-d4?si=zf4P0RFhNB6tx6zh…
TL;DR: Edward, a core PyTorch maintainer, presented at PyTorchCon Europe 2026 on torch.compile pre-compilation support (solving slow and inconsistent JIT compilation in distributed training) and the SPMD type system (helping developers write correct tensor parallelism code and ensure correct gradients).
Current State of PyTorch Ecosystem
As of early March 2026, PyTorch’s GitHub stars and contributions continue to grow. Over the past 12 months, it received 15,000 contributions to the core library and added 1,250 new contributors. PyTorch is used by 90% of research projects and 90% of open-source projects at top machine learning conferences. Notably, usage of dtensor and device mesh is steadily increasing, with over a thousand open-source projects adopting these new features, indicating strong community demand for distributed training optimizations.
Pre-compilation Support
Why Pre-compilation?
Currently, torch.compile uses a just-in-time (JIT) compilation mechanism — each time before running, every node independently executes the compilation process. This introduces two problems in distributed training:
- Slowness: Every time a job is restarted, recompilation is required; cache warm-up takes a long time, and cache lookups must fully trace through Dynamo.
- Inconsistent results: Different nodes may execute different paths. For example, dynamic shapes are common in recommendation systems — one node may recompile because it sees a different shape distribution, causing other nodes to time out while waiting. Even if the compiler is deterministic, different graphs may lead to different optimization choices, and incorrectly reordering communication operations can cause deadlocks.
Pre-compilation Approach
Core idea: Compile once upfront, then distribute the compilation artifacts to all nodes. At runtime, no compilation occurs, ensuring all nodes execute the same code.
Two issues need to be addressed:
- Full model capture: To compile ahead of time, we must know everything that needs to be compiled. Providing a full computation graph makes this much simpler.
- Compile once, run anywhere: Code must not depend on the specific process rank; it must be written in a generic way.
Current Progress
Torch Titan now includes a Graph Trainer, specifically designed for the full model capture workflow. It has been proven to achieve bit-level equivalence with Llama 3, with perfectly matching performance. The new API pre_compile_main accepts training parameters and dumps compilation artifacts to a directory. During actual training, you only need to specify the artifact location and use them directly without retracing the model.
Next steps: Ensure it works for all models in Torch Titan, while also supporting cases where different nodes need different behavior (e.g., FSDP with uneven parameter sharding, or logging only on one node). The relevant PR was just merged into Torch Titan last week.
SPMD Types (Single Program Multiple Data Type System)
What Problem Does It Solve?
In traditional PyTorch, communication operators (e.g., allreduce) lack differentiable autograd functions. Frameworks like Megatron have custom autograd, but writing them is cryptic — for example, an f function that does nothing in the forward pass and performs allreduce in the backward. If you forget to call the correct communication function, you get silently incorrect gradients.
While dtensor guarantees correct gradients, its global semantics frustrate some advanced users: “I want to write programs with local semantics, using ordinary tensors and communication operators, without pretending parallelism doesn’t exist.” Additionally, dtensor can have higher overhead in some patterns.
Design of SPMD Types
SPMD types describe how a tensor (data) is distributed across nodes, but without requiring knowledge of how to stitch it back together (called local SPMD). The main types are:
varying: Tensors differ across nodes. No explanation of how to combine; just that they differ.partial: Tensors not only differ, but must be reduced to obtain the true value. Supports linear operations (because they distribute over sums), but not non-linear ones.invariant: Values are exactly the same on all nodes in the forward pass, and also the same in the backward pass.
Value Proposition
On Megatron-style code that explicitly uses communication operators, you only need to add type annotations to inputs and weights, and use special communication operators that propagate types. A program that passes type checking is guaranteed to produce correct gradients. If you forget to call some strange function, the type checker will warn you, indicating where a communication operator is missing. Type checking is entirely optional, has no runtime overhead, and can be run in unit tests. All existing communication patterns work as-is.
This idea is inspired by JAX’s sharding in types (especially the undocumented reduced/unreduced types) and has been adapted to the PyTorch ecosystem.
Source: @PyTorch: At #PyTorchCon Europe 2026, @ezyang (@Meta) explains why many developers find tensor parallelism difficult to work with… (https://www.youtube.com/watch?si=zf4P0RFhNB6tx6zh&v=xvNh5F9t-d4&feature=youtu.be)
Similar Articles
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
This paper details the design and optimization of PyTorch's distributed data parallel module, highlighting techniques like gradient bucketing and computation-communication overlap that enable near-linear scalability across 256 GPUs.
What I learned building a debugger for PyTorch training loops and how it changed how I think about failure diagnosis [D]
The author shares lessons from building NeuralDBG, an open-source debugger for PyTorch training loops that detects localized failures like vanishing/exploding gradients by monitoring per-layer gradient norm transitions instead of global loss. Practical code snippets and community questions are included.
@PyTorch: PyTorch 2.12 introduces major updates across compilation, export, distributed training, and accelerator support. Highli…
PyTorch 2.12 release includes major updates to compilation, export, distributed training, and accelerator support, with up to 100x faster batched linalg.eigh on CUDA and new APIs like torch.accelerator.Graph.
@HanGuo97: Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao! PDF: https://a…
CODA introduces a GPU kernel abstraction that rewrites transformer computations as GEMM-plus-epilogue programs, reducing memory-bound operations and improving efficiency in training.
DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training
DynaTrain is a distributed training system enabling sub-second online reconfiguration of parallelism for large language models, using a Virtual Parameter Space abstraction to achieve up to three orders of magnitude faster transitions than existing methods.