PyTorch 性能分析（第 2 部分）：从 nn.Linear 到融合 MLP

Hugging Face Blog 2026/06/11 00:00 工具

pytorch profiling linear mlp fusion triton gpu-optimization

摘要

本篇博文继续 PyTorch 性能分析系列内容，探讨 nn.Linear、MLP 块以及使用 Triton 内核的融合技术，以优化性能。

暂无内容

查看原文

查看缓存全文

缓存时间: 2026/06/11 13:33

PyTorch 性能分析（第二部分）：从 nn.Linear 到融合 MLP

来源：https://huggingface.co/blog/torch-mlp-fusion 返回文章列表 (https://huggingface.co/blog)

从 matmul-add 到 Linear (https://huggingface.co/blog/torch-mlp-fusion#from-matmul-add-to-linear)- 转置操作在做什么？ (https://huggingface.co/blog/torch-mlp-fusion#what-is-the-transpose-doing) - 为什么没有单独的 mul 和 add 内核？ (https://huggingface.co/blog/torch-mlp-fusion#why-are-there-no-separate-mul-and-add-kernels) - --compile 能帮单个 Linear 提速吗？ (https://huggingface.co/blog/torch-mlp-fusion#can—compile-help-a-single-linear) - 转置去哪了？内核布局与预操作 (https://huggingface.co/blog/torch-mlp-fusion#where-did-the-transpose-go-kernel-layouts-and-pre-ops)
堆叠三个 Linear：MLP (https://huggingface.co/blog/torch-mlp-fusion#stacking-three-linears-the-mlp)- 为什么会有两种类型的 GEMM 内核？ (https://huggingface.co/blog/torch-mlp-fusion#why-are-there-two-types-of-gemm-kernels) - torch.compile 做了什么？ (https://huggingface.co/blog/torch-mlp-fusion#what-does-torchcompile-do) - 融合的 Triton 内核 (https://huggingface.co/blog/torch-mlp-fusion#the-fused-triton-kernel)
让我们使用手工调优的内核 (https://huggingface.co/blog/torch-mlp-fusion#lets-use-hand-tuned-kernels)- 为什么使用 kernels 库 (https://huggingface.co/blog/torch-mlp-fusion#why-use-the-kernels-library) - 为什么调优的内核更好 (https://huggingface.co/blog/torch-mlp-fusion#why-tuned-kernels-are-better)
结论 (https://huggingface.co/blog/torch-mlp-fusion#conclusion)

博文缩略图 (https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/torch-mlp-fusion/thumbnail.png)

在本系列的第一部分“PyTorch 性能分析“ (https://huggingface.co/blog/torch-profiler) 中，我们使用 torch.add(torch.matmul(x, w), b) 学习了如何阅读 PyTorch 性能分析器跟踪。我们还讨论了其他几个相关主题——CPU 调度链、启动开销、开销受限与计算受限模式的区别，以及 torch.compile 的一些内部细节。

在第二部分（本篇博文）中，我们再上一个台阶。我们将手写的 matmul-add 对替换为 nn.Linear（带有 bias=True）。这是每个深度学习模型都使用的基本构建块。然后，我们将三个这样的层（针对我们的示例）堆叠在一起，中间加上激活函数，形成一个多层感知机（MLP）块。

本博文的脚本位于此处：02_linear.py (https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/02_linear.py)、03_simple_mlp.py (https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/03_simple_mlp.py) 和 03_kernels_mlp.py (https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/03_kernels_mlp.py)。和之前一样，最好在新标签页中打开它们，边阅读边浏览代码。我们使用 NVIDIA A100-SXM4-80GB GPU 来运行脚本。在 Hugging Face 基础设施上设置 GPU 并使用 Spaces 的开发模式 (https://huggingface.co/docs/hub/spaces-dev-mode) 来实验这些脚本非常容易。也可以使用 Hugging Face Jobs 管道 (https://huggingface.co/docs/huggingface_hub/en/guides/jobs) 来运行这些脚本。

在开始之前，快速回顾一下我们将反复使用的两个概念：

GPU 内核是一个程序，它在 GPU 的许多线程上并行运行。
CPU 调度并启动这些内核。你在性能分析器跟踪中看到的大部分 PyTorch 开销就是这种调度工作。

https://huggingface.co/blog/torch-mlp-fusion#from-matmul-add-to-linear从 matmul-add 到 Linear

nn.Linear 是一个模块包装器，包装了我们已经在第 1 部分 (https://huggingface.co/blog/torch-profiler) 中分析过的相同矩阵乘法和加法操作。唯一的区别是它拥有自己的权重和偏置作为参数，并公开了一个 PyTorch 用户已经熟悉的 forward 方法。

bias=True 将真正模拟我们在系列第一部分中看到的乘法和加法操作

linear_layer = nn.Linear(in_dim, out_dim, bias=True) y = linear_layer(x) ``

此时的操作可以写成：

y = x @ w.T + b

其中 x 是输入，w 是权重，b 是偏置。让我们运行 02_linear.py (https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/02_linear.py) 并检查性能分析。

uv run 02_linear.py --batch 1024 --in_dim 32 --out_dim 64 uvx trace-util traces -b traces

trace-util (https://x.com/ariG23498/status/2054811716727517374) 是一个工具，它可以将你的跟踪同步到 Hugging Face 存储桶 (https://huggingface.co/storage)，然后在你的终端上提供 Preffeto URL (https://perfetto.dev/)。

PyTorch 性能分析跟踪：nn.Linear 前向传播的 CPU 轨道上显示三个简短的 Profile Step 和 linear_fwd 注释，GPU 轨道上显示一个微小的内核，最后还有一个很长的 cudaDeviceSynchronize 条 (https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/torch-mlp-fusion/linear-profile-trace.png)图 1：nn.Linear 的性能分析跟踪图 1 显示了线性层前向调用的性能分析跟踪。我们使用与之前跟踪类似的 schedule 设置来跟踪线性层的 forward 调用，其中 wait=1、warmup=1 和 active=3。这就是我们在 CPU 和 GPU 轨道上看到三个 Profile Step 的原因。

https://huggingface.co/blog/torch-mlp-fusion#what-is-the-transpose-doing转置操作在做什么？

放大后的 CPU 调度链，显示在 aten::linear 内部的 aten::addmm 之前嵌套了 aten::t 转置操作，GPU 轨道上没有匹配的活动 (https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/torch-mlp-fusion/transpose-cpu-dispatch.png)图 2：CPU 转置行如果放大性能分析跟踪，如图 2 所示，我们会注意到在 aten::addmm（乘法和加法）操作之前有一个 aten::t（转置）操作。我们已经可以推断出 nn.Linear 会转置权重参数，然后将其与输入相乘。这就是我们看到 aten::t 操作的原因。

需要注意的重要一点是，aten::t 实际上并不会复制或重新组织数据：它只是在 CPU 上重写张量元数据（形状和步长）以表示转置后的矩阵。它不会在 GPU 上启动内核。你可以通过两种方式验证这一点：查看跟踪中的 GPU 轨道，或者检查性能分析表格中的 aten::t 行及其在 CUDA 上花费的时间。

https://huggingface.co/blog/torch-mlp-fusion#why-are-there-no-separate-mul-and-add-kernels为什么没有单独的 `mul` 和 `add` 内核？

线性层的性能分析跟踪，突出显示了调度链，显示 aten::linear、aten::t 和 aten::addmm，但没有单独的 aten::add 操作 (https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/torch-mlp-fusion/no-aten-add.png)图 3：线性层的性能分析中没有 aten::add 如图 3 所示，在线性层的调度链中没有 aten::add（偏置加法）。这是因为偏置加法已经被“折叠”到矩阵乘法内核中，使用了所谓的epilogue。

Epilogue 是 GEMM（GEneral Matrix Multiply，通用矩阵乘法）内核在最后、即将其结果写回 HBM（High Bandwidth Memory，高带宽内存，GPU 的主内存）之前执行的一个小型计算。添加偏置、应用激活函数或乘以常数都是经典的 epilogue。epilogue 的目的是避免第二次加载或写入 HBM，因为内存流量会使操作变得昂贵。

nn.Linear 调用 torch.nn.functional.linear，而后者又调用 aten::linear。aten::linear 查看输入，注意到传入了偏置，因此调度 aten::addmm(bias, x, weight)，而不是分别执行 matmul 和 add。addmm 计算：

out = x @ weight.T + bias

在 GPU 上运行的 cuBLAS GEMM 内核内置了一个偏置加法的变体，aten::addmm 选择的就是这个内核。加法永远不会作为单独的内核出现，因为它是 matmul 内核写回过程的一部分，而这正是 epilogue 的定义。

现在需要注意一个微妙之处。你在第 1 部分 --compile (https://huggingface.co/blog/torch-profiler#did-we-fuse-the-matmul-and-add-kernels-into-one) 下看到的内核（addmm）正是 eager 模式下的 nn.Linear 已经使用的内核。这里已经没有什么留给 torch.compile 去融合的了，接下来我们将验证这一点。

https://huggingface.co/blog/torch-mlp-fusion#can—compile-help-a-single-linear`--compile` 能帮单个 Linear 提速吗？

让我们编译前向调用并查看性能分析跟踪。（性能分析跟踪在下一节 (https://huggingface.co/blog/torch-mlp-fusion#where-did-the-transpose-go-kernel-layouts-and-pre-ops) 中可视化）

uv run 02_linear.py --batch 1024 --in_dim 32 --out_dim 64 --compile uvx trace-util traces -b traces

如果你比较单个 nn.Linear 的 forward 的 eager 和编译跟踪，你会发现：

GPU 上相同的一个 cuBLAS GEMM 内核。
CPU 上相同的 aten::addmm 操作。
CPU 轨道上多了几行编译特有的内容。

这一点值得内化。常见的反应是每当模型感觉慢时就使用 torch.compile。对于单个带偏置的 GEMM，compile 几乎无事可做。这不是一个 bug，而是因为 compile 需要多个操作才可能进行融合。我们将通过查看 MLP (https://huggingface.co/blog/torch-mlp-fusion#stacking-two-linears-the-mlp) 来证明这一点。

https://huggingface.co/blog/torch-mlp-fusion#where-did-the-transpose-go-kernel-layouts-and-pre-ops转置去哪了？内核布局与预操作

仔细阅读两个跟踪（eager vs compile）的读者会注意到，eager 模式的 CPU 调度链比编译模式包含更多内容。

Eager CPU 调度链，在 aten::linear 下，aten::t 转置和 aten::addmm 被分别框起来 (https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/torch-mlp-fusion/eager.png)图 4：Eager 调度链，其中 aten::linear 经过 aten::t（转置）然后再到 aten::addmm 编译后的 CPU 调度链，显示一个 Torch-Compiled Region 和一个单独的 aten::addmm 调用，没有转置操作 (https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/torch-mlp-fusion/compile.png)图 5：编译后的调度链，直接调用 aten::addmm，没有转置 aten::linear 内部的 eager CPU 调度链是 aten::t 后跟 aten::addmm（图 4）。要理解 aten::t 实际上做了什么，我们需要快速了解一下步长和视图。

张量将其数据作为内存中一个扁平的连续数字序列存储。shape 和 stride 是位于该序列之上的元数据，告诉 PyTorch 如何遍历它：步长 (s0, s1) 意味着“移动一行需要走 s0 个元素，移动一列需要走 s1 个元素”。更改元数据，你就会得到相同原始数据的不同视图，而无需复制：

M = torch.tensor([[0, 1], … [2, 3], … [4, 5]]) M.shape, M.stride() (torch.Size([3, 2]), (2, 1)) # 每行两步，每列一步

T = M.t() # 转置 T.shape, T.stride() (torch.Size([2, 3]), (1, 2)) # 形状和步长互换，数据不变 T tensor([[0, 2, 4], [1, 3, 5]]) T.flatten() # 强制具体化，因此数据被重新排序 tensor([0, 2, 4, 1, 3, 5]) ``

M.t() 没有移动任何数字。它返回了一个步长互换的新视图，因此逐行读取现在会以转置的顺序遍历原始缓冲区 0, 1, 2, 3, 4, 5。底层数据完全相同；只有元数据不同。

这正是 aten::t 在线性层内部所做的：它不会分配新的张量或复制任何数据，而是生成一个步长被重写的权重视图。

如图 5 所示，compile 并没有移除 GPU 内核：它移除了调度该视图的CPU 开销。Inductor 在编译时跟踪了整个视图链，一次性计算出最终步长，并直接发出一个携带这些硬编码步长的 aten::addmm 调用。几微秒的 CPU 工作消失了，而 GPU 执行着完全相同的数学运算。

正如预期，当输入数据违反了编译器预计算的步长时，它会抛出错误。

如果你查看两个跟踪中的 GPU 轨道，每次前向恰好有一个内核，并且两次都是相同的内核：

cutlass_80_wmma_tensorop_bf16_s161616gemm_bf16_32x32_32x1_tn_align8

如果没有转置内核运行，是谁教会了 GEMM 以转置顺序读取权重矩阵？答案在内核名称中。看后缀：

cutlass_80_wmma_tensorop_bf16_s161616gemm_bf16_32x32_32x1_tn_align8 ^^

这个 tn 就是布局描述符。cuBLAS 和 CUTLASS 为每种输入布局组合预编译了单独的内核二进制文件。

n（非转置）和 t（转置）描述了内核在其内部循环中如何遍历其输入。调度器的工作是查看输入步长，决定哪个后缀组合匹配，并选择正确的预编译内核。

性能分析器跟踪中的内核名称是该内核身份的哈希转储。如果两次运行显示相同的内核名称，则 GPU 在做相同的工作。如果它们不同（例如，_tn_ vs _nn_，bf16 vs fp16，或 s16816gemm vs s161616gemm），那么 GPU 在做不同的工作，调度器走了不同的分支。学会读取这个名称是你在比较跟踪时最有用的习惯之一。

https://huggingface.co/blog/torch-mlp-fusion#stacking-three-linears-the-mlp堆叠三个 Linear：MLP

在本节中，我们将分析一个多层感知机（MLP）。为了更有趣，我们将分析一个带有 GeGLU 激活变体的前馈网络（这在实践中非常常用）。这也是我们向深度学习研究史上最伟大的行之一致敬的方式（图 6）。

`` class SimpleGeGLUMLP(nn.Module): def init(self, dim, hidden): super().init() self.gate_proj = nn.Linear(dim, hidden, bias=False) self.up_proj = nn.Linear(dim, hidden, bias=False) self.down_proj = nn.Linear(hidden, dim, bias=False)

def forward(self, x):
    g = self.gate_proj(x)
    u = self.up_proj(x)
    h = F.gelu(g, approximate="tanh")
    m = h * u
    y = self.down_proj(m)
    return y

完整的脚本在这里：03_simple_mlp.py (https://huggingface.co/datasets/ariG23498/profiling-pytorch/blob/main/03_simple_mlp.py)。按如下方式执行：

uv run 03_simple_mlp.py --batch 64 --seq 128 --dim 768 --hidden 3072 uvx trace-util traces -b traces

在打开跟踪之前，让我们一起思考一下应该看到什么。forward 函数做了相当多的计算，但其中大部分我们已经熟悉了。

我们应该期望看到三个 aten::linear 调度，每个 nn.Linear 层一个。我们还应该期望看到两个逐元素内核启动，一个用于 GeLU，一个用于乘法。在看之前形成这种预期是性能分析过程中最有用的习惯：你阅读跟踪是为了确认或推翻一个猜测，而不是从零开始形成猜测。

GeGLU MLP 前向传播的性能分析跟踪，CPU 轨道上有五个标有 linear、linear、gelu、mul、linear 的框组 (https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/torch-mlp-fusion/simple-mlp-eager.png)图 7：GeGLU MLP 的性能分析跟踪在线性投影跟踪中突出显示的占用率查询 (

PyTorch 性能分析（第 2 部分）：从 nn.Linear 到融合 MLP

PyTorch 性能分析（第二部分）：从 nn.Linear 到融合 MLP

https://huggingface.co/blog/torch-mlp-fusion#from-matmul-add-to-linear从 matmul-add 到 Linear

bias=True 将真正模拟我们在系列第一部分中看到的乘法和加法操作

https://huggingface.co/blog/torch-mlp-fusion#what-is-the-transpose-doing转置操作在做什么？

https://huggingface.co/blog/torch-mlp-fusion#why-are-there-no-separate-mul-and-add-kernels为什么没有单独的 `mul` 和 `add` 内核？

https://huggingface.co/blog/torch-mlp-fusion#can—compile-help-a-single-linear`--compile` 能帮单个 Linear 提速吗？

https://huggingface.co/blog/torch-mlp-fusion#where-did-the-transpose-go-kernel-layouts-and-pre-ops转置去哪了？内核布局与预操作

https://huggingface.co/blog/torch-mlp-fusion#stacking-three-linears-the-mlp堆叠三个 Linear：MLP

相似文章

@ariG23498: 现在是性能分析时间！在第2部分中，我们涵盖：> 追踪线性层 > 讨论 mul + add 与 linear 的对比 > gemm epilogues (我最…

PyTorch 中的性能分析（第一部分）：torch.profiler 初学者指南

PyTorch 性能分析 (第3部分)：注意力即剖析

@PyTorch: https://bit.ly/4yawNqB..*

@ariG23498: 当我第一次从@cHHillee的博客文章“Making Deep Learning Go Brrrr From …”中听说内核融合时，我着迷了。

提交意见反馈

PyTorch 性能分析（第二部分）：从 nn.Linear 到融合 MLP

https://huggingface.co/blog/torch-mlp-fusion#from-matmul-add-to-linear从 matmul-add 到 Linear

bias=True 将真正模拟我们在系列第一部分中看到的乘法和加法操作

https://huggingface.co/blog/torch-mlp-fusion#what-is-the-transpose-doing转置操作在做什么？

https://huggingface.co/blog/torch-mlp-fusion#why-are-there-no-separate-mul-and-add-kernels为什么没有单独的 mul 和 add 内核？

https://huggingface.co/blog/torch-mlp-fusion#can—compile-help-a-single-linear--compile 能帮单个 Linear 提速吗？

https://huggingface.co/blog/torch-mlp-fusion#where-did-the-transpose-go-kernel-layouts-and-pre-ops转置去哪了？内核布局与预操作

https://huggingface.co/blog/torch-mlp-fusion#stacking-three-linears-the-mlp堆叠三个 Linear：MLP

相似文章

@ariG23498: 现在是性能分析时间！在第2部分中，我们涵盖：> 追踪线性层 > 讨论 mul + add 与 linear 的对比 > gemm epilogues (我最…

PyTorch 中的性能分析（第一部分）：torch.profiler 初学者指南

PyTorch 性能分析 (第3部分)：注意力即剖析

@PyTorch: https://bit.ly/4yawNqB..*

@ariG23498: 当我第一次从@cHHillee的博客文章“Making Deep Learning Go Brrrr From …”中听说内核融合时，我着迷了。

提交意见反馈

https://huggingface.co/blog/torch-mlp-fusion#why-are-there-no-separate-mul-and-add-kernels为什么没有单独的 `mul` 和 `add` 内核？

https://huggingface.co/blog/torch-mlp-fusion#can—compile-help-a-single-linear`--compile` 能帮单个 Linear 提速吗？