面向MLSys的现代GPU编程

Hacker News Top 2026/06/23 11:38 新闻

gpu-programming ml-systems blackwell-architecture flash-attention gemm-optimization machine-learning tirx

摘要

CMU机器学习系统课程的一本新书教授面向ML系统的现代GPU编程，涵盖Blackwell架构、GEMM和FlashAttention，使用TIRx Python DSL。

暂无内容

查看原文

查看缓存全文

缓存时间: 2026/06/26 20:20

# 现代GPU编程用于MLSys — 现代GPU编程用于MLSys 来源：https://mlc.ai/modern-gpu-programming-for-mlsys/ ## 现代GPU编程用于MLSys\# (https://mlc.ai/modern-gpu-programming-for-mlsys/#modern-gpu-programming-for-mlsys) 机器学习系统是现代AI工作负载的核心。在这些系统中，性能往往取决于少量GPU内核的质量。注意力内核、LLM预填充和解码内核、低精度块缩放GEMM、融合MoE层以及其他大型融合内核，都会直接影响训练和推理服务的端到端速度。然而，要让这些内核跑得快，不能只靠罗列优化技巧。现代GPU已不再是同一设计简单变体。最新架构引入了更丰富的内存空间、新的访问模式以及日益专门的执行单元。要高效地编程，既需要清晰的硬件心智模型，也需要对高性能内核如何构建有实际理解。本书的目标就是培养这两方面能力。本书遵循简单递进路径：先理解GPU硬件，再学习我们使用的编程模型，然后逐步构建最先进的内核。主要目标架构是Blackwell世代，主要运行示例是快速矩阵乘法（GEMM）和FlashAttention。在此过程中，我们还将研究GPU优化的核心要素：数据布局、异步数据移动和异步协调。本书内容源自卡内基梅隆大学的机器学习系统 (https://mlsyscourse.org/) 课程系列。为让概念更易于学习和运行，本书使用**TIRx** Python DSL逐步构建真实GPU内核示例。TIRx贴近硬件，让我们能在运行代码的同时推理底层控制。 ## 本书组织方式\# (https://mlc.ai/modern-gpu-programming-for-mlsys/#how-this-book-is-organized) - **第一部分：理解GPU。** 介绍GPU的整体组织、编写快速内核的通用方法，以及数据布局、异步内存操作和协调等关键概念。为后续内容建立硬件直觉。 - **第二部分：TIRx概述。** 介绍TIRx的关键要素，为全书代码示例奠定基础。 - **第三部分：GEMM：从分块到SOTA。** 通过TMA流水线、持久调度、warp specialization和2-CTA集群，完整讲解如何优化分块GEMM。 - **第四部分：Flash Attention 4。** 基于第三部分技术构建的完整注意力内核：包含两个带softmax的MMA、在线softmax重缩放、因果掩码和GQA。 - **参考。** TIRx语言参考和编译器内部机制。第三部分：GEMM：从分块到SOTA - 构建分块GEMM (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_basics/index.html) - GEMM (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_basics/index.html#gemm) - 优化路径 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_basics/index.html#optimization-path) - 步骤1：顺序单Tile GEMM (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_basics/index.html#step-1-sequential-single-tile-gemm) - 步骤2：K循环累加 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_basics/index.html#step-2-k-loop-accumulation) - 步骤3：空间分块（多CTA）(https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_basics/index.html#step-3-spatial-tiling-multi-cta) - 练习题 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_basics/index.html#exercises) - 使用TMA实现GEMM流水线 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_async/index.html) - 步骤4：TMA异步加载 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_async/index.html#step-4-tma-async-load) - 步骤5：软件流水线（PIPE_DEPTH=2）(https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_async/index.html#step-5-software-pipeline-pipe-depth-2) - 步骤6：持久内核 + Tile调度器 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_async/index.html#step-6-persistent-kernel-tile-scheduler) - 练习题 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_async/index.html#exercises) - 通过Warp Specialization和集群扩展GEMM (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_advanced/index.html) - 步骤7：Warp Specialization + 流水线 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_advanced/index.html#step-7-warp-specialization-pipeline) - 步骤8：2-CTA集群 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_advanced/index.html#step-8-2-cta-cluster) - 步骤9：多消费者Warp Specialization (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_advanced/index.html#step-9-multi-consumer-warp-specialization) - 端到端结果 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_advanced/index.html#end-to-end-result) - 练习题 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_gemm_advanced/index.html#exercises) 第四部分：Flash Attention 4 - Flash Attention 4 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html) - 算法形态 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#algorithm-shape) - Tile-基元图 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#tile-primitive-graph) - Warp角色与作用域 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#warp-roles-and-scopes) - 读取片段 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#reading-the-fragments) - 两个MMA阶段 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#the-two-mma-phases) - TMEM布局与重用 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#tmem-layout-and-reuse) - 屏障如何连接角色 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#how-barriers-connect-the-roles) - 流水线结构 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#pipelining-structure) - 重缩放与回写 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#rescaling-and-writeback) - 因果掩码 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#causal-masking) - GQA支持 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#gqa-support) - Tile调度 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#tile-scheduling) - 编译与验证 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#compile-and-verify) - 与GEMM的区别 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#differences-from-gemm) - 练习题 (https://mlc.ai/modern-gpu-programming-for-mlsys/chapter_flash_attention/index.html#exercises)

面向MLSys的现代GPU编程

相似文章

@levidiamode: GPU编程的第163/365天 - 今天看几个不同的agentic GPU内核优化系统。我最感兴趣的两个是…

@loganthorneloe：阅读此文，开始学习机器学习基础设施。这是对机器学习中重要考虑因素的极好高层概述……

大规模LLM推理开放手册（GPU内部机制、KV缓存、批处理、vLLM/SGLang/TensorRT-LLM）[P]

@levidiamode: GPU编程第158/365天——我觉得我大致理解了FlashAttention 2、3和4前向传播的高级区别…

@levidiamode: Day 138/365 of GPU Programming 今年我最喜欢的讲座之一是斯坦福大学的CS336第7讲关于GPU…

提交意见反馈