@0x0SojalSec: 去他妈的付费课程，掌握AI系统的GPU工程。从基础书籍和CUDA/ROCm编程到低阶…

X AI KOLs Timeline 2026/07/02 20:27 工具

gpu-engineering cuda rocm ai-acceleration distributed-training high-performance-computing awesome-list

摘要

一份精心整理的资源列表，用于掌握AI系统的GPU工程，涵盖CUDA、ROCm、优化工具、多GPU编排和分布式训练。

去他妈的付费课程，掌握AI系统的GPU工程。从基础书籍和CUDA/ROCm编程到低层优化、Nsight工具、多GPU编排、分布式训练和AI加速技术。优秀参考，适用于嵌入式GPU工作或大规模AI基础设施，精心整理的合集涵盖： - CUDA & ROCm 编程 - 内核优化与性能工具 - 多GPU系统与分布式训练 - 架构深入剖析、Triton、CUTLASS 等对于从事高性能AI基础设施、内核开发或系统级GPU工作的人来说，这是一个宝库。 - http://github.com/goabiaryan/awesome-gpu-engineering…

查看原文

查看缓存全文

缓存时间: 2026/07/03 08:32

去他妈的付费课程，精通AI系统的GPU工程。

从基础书籍和CUDA/ROCm编程到低层优化、Nsight工具、多GPU编排、分布式训练以及AI加速技术。

非常适合嵌入式GPU工作或大规模AI基础设施的优质参考资源，精心策划的资源列表涵盖：

CUDA和ROCm编程
内核优化与性能工具
多GPU系统与分布式训练
架构深度剖析、Triton、CUTLASS等

对从事高性能AI基础设施、内核开发或系统级GPU工作的人来说，简直是金矿。

http://github.com/goabiaryan/awesome-gpu-engineering…

goabiaryan/awesome-gpu-engineering

来源：https://github.com/goabiaryan/awesome-gpu-engineering

Awesome GPU Engineering Awesome (https://awesome.re)

精心策划的资源列表，帮助你掌握GPU工程，从架构和内核编程到大规模分布式系统及AI加速。

📘 基础书籍

《大规模并行处理器编程：实战方法》 — David B. Kirk & Wen-mei W. Hwu CUDA、内存层次结构和并行模式的权威入门书籍。Amazon (https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0323912311)，笔记：Abi的精简笔记 (https://github.com/goabiaryan/awesome-gpu-engineering/blob/main/notes/Abi’s%20PMPP%20Notes.pdf)
《CUDA实战》 — Jason Sanders & Edward Kandrot 面向初学者的CUDA实用入门。Amazon (https://www.amazon.com/CUDA-Example-Introduction-General-Purpose-Programming/dp/0131387685)
《超大规模训练手册：在GPU集群上训练LLM》 - Hugging Face 在线版本 (https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=high-level_overview)

💻 GPU编程框架

CUDA (https://developer.nvidia.com/cuda-toolkit) — NVIDIA专有的GPU编程平台。
- 库：cuBLAS (https://developer.nvidia.com/cublas), cuDNN (https://developer.nvidia.com/cudnn)
ROCm (https://github.com/RadeonOpenCompute/ROCm) — AMD的开源计算栈。
OpenCL (https://www.khronos.org/opencl/) — 跨平台并行计算标准。
SYCL / oneAPI (https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html) — Intel用于异构计算的C++抽象层。
Vulkan Compute (https://www.khronos.org/vulkan/) — 底层GPU计算API。
Kompute — 基于Vulkan构建的高级通用GPU计算框架。
Metal Performance Shaders (https://developer.apple.com/metal/) — Apple的GPU框架。
Mojo🔥 (https://mojolang.org) — 像Python一样编写，像C++一样运行。

🧩 优化与性能

NVIDIA Nsight Systems (https://developer.nvidia.com/nsight-systems) — 系统级GPU分析器。
Nsight Compute (https://developer.nvidia.com/nsight-compute) — 内核级性能分析。
Occupancy Calculator — NVIDIA用于内核配置的电子表格。
CUTLASS (https://github.com/NVIDIA/cutlass) — 用于线性代数子程序的CUDA模板。
TensorRT (https://developer.nvidia.com/tensorrt) — 高性能深度学习推理引擎。
OpenAI Triton (https://triton-lang.org/) — 用于编写高性能GPU内核的Python DSL。
Helion (https://helionlang.com) — 一个嵌入在Python中的DSL，让你用最少样板代码轻松编写快速、可扩展的ML内核。
Roofline模型 (https://jax-ml.github.io/scaling-book/) — 用于分析计算/内存瓶颈的分析模型。

🧠 架构与低层设计

NVIDIA Ampere白皮书 (https://developer.nvidia.com/ampere-architecture)
AMD RDNA和CDNA架构 (https://gpuopen.com/learn/)
SIMT执行与warp调度
内存层次结构与合并访问
共享内存与缓存优化
Warp分支与线程占用

⚙️ 系统与多GPU工程

NCCL (https://developer.nvidia.com/nccl) — 多GPU通信原语。
vLLM (https://github.com/vllm-project/vllm) — 用于LLM的推理与服务引擎
Hugging Face Accelerate (https://github.com/huggingface/accelerate) — 简化分布式训练的抽象层
SGLang (https://github.com/sgl-project/sglang)
Prime Intellect (https://github.com/PrimeIntellect-ai/prime-cli)
TensorRT-LLM (https://github.com/NVIDIA/TensorRT-LLM)
TGI by Hugging Face (https://huggingface.co/docs/text-generation-inference/en/index)
Horovod (https://github.com/horovod/horovod) — 跨GPU的分布式深度学习。
NVLink与PCIe拓扑 — GPU互连与带宽优化。
GPUDirect RDMA (https://developer.nvidia.com/gpudirect) — 零拷贝GPU网络。
Ray Train (https://docs.ray.io/en/latest/train/index.html)、DeepSpeed (https://github.com/microsoft/DeepSpeed)、Megatron-LM (https://github.com/NVIDIA/Megatron-LM) — 大规模GPU编排框架。
AMD Iris (https://github.com/ROCm/iris) — 开源的多GPU编程框架，专为编译器可见性能和优化的多GPU执行而构建。

🧪 教程与课程

CUDA C++编程指南 (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)
Triton教程（OpenAI）(https://triton-lang.org/main/getting-started/tutorials/index.html)
FreeCodeCamp的12小时CUDA课程 (https://www.youtube.com/watch?v=86FAWCzIe_4) 和视频仓库 (https://github.com/infatoshi/cuda-course)
斯坦福CS149 2025秋季并行计算课程 (https://gfxcourses.stanford.edu/cs149/fall25/)
CMU 15-418/618：并行计算机架构与编程 (https://www.cs.cmu.edu/~418/)
MIT 6.5940：TinyML与高效深度学习计算 (https://hanlab.mit.edu/courses/2024-fall-65940)
GPU MODE视频讲座系列 (https://www.youtube.com/@GPUMODE/videos)
Red Hat vLLM Office Hours视频系列 (https://www.youtube.com/playlist?list=PLbMP1JcGBmSHxp4-lubU5WYmJ9YgAQcf3)
《大规模并行处理器编程》作者们的课程 (https://www.youtube.com/@pmpp-book/playlists)

📄 研究论文与文章

GPU编程优化技术 (https://dl.acm.org/doi/pdf/10.1145/3570638) - Hijma, Pieter等人
Python中高效的多GPU编程：减少同步和访问开销 (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11186485) - Oden, Lena与Klaus Nölp
GPU架构演进 (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9623445&casa_token=Zknb-Go77Y4AAAAA:03tRVI5oLoyDZMx-UZZiWp9h7JRTc-UHNmiHykq2MZWBKNFBwjxEUpuddkX54Z246I6gjDUpdw&tag=1) — Kirk & Hwu
GPU数据中心中的深度学习工作负载调度：分类、挑战与愿景 (https://arxiv.org/abs/2205.11913) - Wei Gao等人
利用CUDA优化机器学习模型：全面性能分析 (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11064558) - Niteesh, L.与M. B. Ampareeshan
NVIDIA关于*模型并行 (https://dl.acm.org/doi/pdf/10.1145/3458817.3476209?casa_token=p3epEa_Z4xEAAAAA:fZgVzYD2uMH5NcafdBN9g7EgIbESqB7WsHjL0X6LU2zdm6EdgQkMyIFk0yZAfWGl1o3PeUSB4xhg)和Megatron-LM (https://arxiv.org/pdf/1909.08053)*的研究论文
GPU虚拟化与多租户调度 (https://dl.acm.org/doi/pdf/10.1145/3068281?casa_token=bbU9Dvrt3vsAAAAA:jxP-NNGr8GEmjOng-EFlb1Rd6wVSQAXg65GTK1jDPlGIkGjNIirMWkDZcjnTw0xDZmLGZ489LwHX)
多租户深度学习推理在GPU上的综述 (https://arxiv.org/abs/2203.09040)
通过内核空间拦截实现兼容性与隔离的高效性能感知GPU共享 (https://www.youtube.com/watch?v=e54BVwcdJ4Y)

🧰 工具与实用程序

nvprof、nvvp、Nsight Systems / Compute — NVIDIA性能分析工具。
cuda-memcheck、compute-sanitizer — 内存与正确性工具。
GPGPU-Sim (https://github.com/gpgpu-sim/gpgpu-sim)、Accel-Sim (https://accel-sim.github.io/) — GPU仿真框架。
Ingero (https://github.com/ingero-io/ingero) — 基于eBPF的GPU因果可观测性代理。追踪CUDA运行时/驱动API和主机内核事件，构建解释GPU延迟的因果链。开销低于2%，适用于生产环境。
Perfetto、Nsight UI — 用于追踪GPU工作负载的可视化分析器。

学习工具

Tensara (https://gpuengineering.com)
LeetGPU (https://leetgpu.com/)
GPU MODE Discord (https://discord.gg/FnjEVAhW)
GPU词汇表 (https://modal.com/gpu-glossary) — 与GPU编程相关的术语词典
Mojo🔥 GPU谜题 (https://puzzles.modular.com)

🧑‍🔬 用于AI和ML的GPU

PyTorch CUDA扩展 — 用于PyTorch的自定义内核。
JAX + XLA — 基于编译器的GPU向量化。
TensorFlow XLA编译器 — 提前编译GPU图。
FlashAttention、FlashConv — Transformer的内核优化技术。
DeepSpeed、FSDP、Megatron-LM — 分布式训练系统。

🧱 面试准备：GPU系统设计主题

FlashAttention和PagedAttention
矩阵乘法运算
GPU调度算法与运行时系统。
内存超卖与统一内存模型。
GPU集群资源分配。
GPU虚拟化
内核融合与图执行
数据流优化
持久线程模型

🧑‍💻 贡献者

欢迎贡献！在提交拉取请求之前，请阅读贡献指南。

🧾 许可证

CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/) — 可自由分享和改编，但需注明出处。

⭐ 致谢

灵感来源于：

Awesome HPC (https://github.com/trevor-vincent/awesome-high-performance-computing)
Awesome Computer Architecture (https://github.com/aalhour/awesome-computer-architecture)
Awesome CUDA (https://github.com/coderonion/awesome-cuda-and-hpc)

“GPU工程不仅仅是编写内核。更是理解系统是如何工作的。” — Model Craft (https://modelcraft.substack.com/p/fundamentals-of-gpu-engineering)