@0x0SojalSec: Fuck your paid courses, Master GPU engineering for AI systems. From foundational books and CUDA/ROCm programming to low…

X AI KOLs Timeline 07/02/26, 08:27 PM Tools

gpu-engineering cuda rocm ai-acceleration distributed-training high-performance-computing awesome-list

Summary

A curated list of resources for mastering GPU engineering for AI systems, covering CUDA, ROCm, optimization tools, multi-GPU orchestration, and distributed training.

Fuck your paid courses, Master GPU engineering for AI systems. From foundational books and CUDA/ROCm programming to low-level optimization, Nsight tools, multi-GPU orchestration, distributed training and AI acceleration techniques. Excellent reference for embedded GPU work or large-scale AI infrastructure, curated collection covering: - CUDA & ROCm programming - Kernel optimization & performance tools - Multi-GPU systems & distributed training - Architecture deep dives, Triton, CUTLASS, and more A goldmine for anyone working on high-performance AI infrastructure, kernel development, or systems-level GPU work. - http://github.com/goabiaryan/awesome-gpu-engineering…

Original Article

View Cached Full Text

Cached at: 07/03/26, 08:32 AM

Fuck your paid courses, Master GPU engineering for AI systems.

From foundational books and CUDA/ROCm programming to low-level optimization, Nsight tools, multi-GPU orchestration, distributed training and AI acceleration techniques.

Excellent reference for embedded GPU work or large-scale AI infrastructure, curated collection covering:

CUDA & ROCm programming
Kernel optimization & performance tools
Multi-GPU systems & distributed training
Architecture deep dives, Triton, CUTLASS, and more

A goldmine for anyone working on high-performance AI infrastructure, kernel development, or systems-level GPU work.

http://github.com/goabiaryan/awesome-gpu-engineering…

goabiaryan/awesome-gpu-engineering

Source: https://github.com/goabiaryan/awesome-gpu-engineering

Awesome GPU Engineering

A curated list of resources for mastering GPU engineering from architecture and kernel programming to large-scale distributed systems and AI acceleration.

📘 Foundational Books

Programming Massively Parallel Processors: A Hands-on Approach — David B. Kirk & Wen-mei W. Hwu The canonical introduction to CUDA, memory hierarchies, and parallel patterns. Amazon , notes: Abi’s Concise Notes
CUDA by Example — Jason Sanders & Edward Kandrot
A practical introduction to CUDA for beginners. Amazon
The Ultra-Scale Playbook: Training LLMs on GPU Clusters - Hugging Face Web Version

💻 GPU Programming Frameworks

CUDA — NVIDIA’s proprietary GPU programming platform.
- Libraries: cuBLAS, cuDNN
ROCm — AMD’s open compute stack.
OpenCL — Cross-platform parallel computing standard.
SYCL / oneAPI — Intel’s C++ abstraction for heterogeneous compute.
Vulkan Compute — Low-level GPU compute API.
Kompute — Higher level general purpose GPU compute framework built on Vulkan.
Metal Performance Shaders — Apple’s GPU framework.
Mojo🔥 - Write like Python, run like C++.

🧩 Optimization and Performance

NVIDIA Nsight Systems — System-wide GPU profiler.
Nsight Compute — Kernel-level performance analysis.
Occupancy Calculator — NVIDIA spreadsheet for kernel configuration.
CUTLASS — CUDA templates for linear algebra subroutines.
TensorRT — High-performance deep learning inference.
OpenAI Triton — Python DSL for writing high-performance GPU kernels.
Helion - A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
Roofline Model — Analytical model to reason about compute/memory bottlenecks.

🧠 Architecture and Low-Level Design

NVIDIA Ampere Whitepaper
AMD RDNA & CDNA Architectures
SIMT execution and warp scheduling
Memory hierarchy and coalescing
Shared memory and cache optimization
Warp divergence and thread occupancy

⚙️ Systems and Multi-GPU Engineering

NCCL — Multi-GPU communication primitives.
vLLM - Inference and serving engine for LLMs
Hugging Face Accelerate - Simplify abstractions for distributed training
SGLang
Prime Intellect
TensorRT-LLM
TGI by Hugging Face
Horovod — Distributed deep learning across GPUs.
NVLink & PCIe Topology — GPU interconnects and bandwidth optimization.
GPUDirect RDMA — Zero-copy GPU networking.
Ray Train, DeepSpeed, Megatron-LM — Large-scale GPU orchestration frameworks.
Iris by AMD - open-source multi-GPU programming framework built for compiler-visible performance and optimized multi-GPU execution.

🧪 Tutorials and Courses

📄 Research Papers and Articles

Optimization techniques for GPU programming - Hijma, Pieter, et al.
Efficient Multi-GPU Programming in Python: Reducing Synchronization and Access Overheads - Oden, Lena, and Klaus Nölp
Evolving GPU Architecture — Kirk & Hwu
Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision - Wei Gao et al
Optimizing Machine Learning Models with CUDA: A Comprehensive Performance Analysis - Niteesh, L., and M. B. Ampareeshan
NVIDIA Research Papers on Model Parallelism and Megatron-LM
GPU Virtualization and Multi-Tenant Scheduling
A Survey of Multi-Tenant Deep Learning Inference on GPU
Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception

🧰 Tools and Utilities

nvprof, nvvp, Nsight Systems / Compute — NVIDIA profiling tools.
cuda-memcheck, compute-sanitizer — Memory and correctness tools.
GPGPU-Sim, Accel-Sim — GPU simulation frameworks.
Ingero — eBPF-based GPU causal observability agent. Traces CUDA Runtime/Driver APIs and host kernel events to build causal chains explaining GPU latency. <2% overhead, production-safe.
Perfetto, Nsight UI — Visual profilers for tracing GPU workloads.

Learning Tools

Tensara
LeetGPU
GPU MODE Discord
GPU Glossary - A dictionary of terms related to programming GPUs
Mojo🔥 GPU Puzzles

🧑‍🔬 GPU for AI & ML

PyTorch CUDA Extensions — Custom kernels for PyTorch.
JAX + XLA — Compiler-based GPU vectorization.
TensorFlow XLA Compiler — Ahead-of-time GPU graph compilation.
FlashAttention, FlashConv — Kernel optimization techniques for transformers.
DeepSpeed, FSDP, Megatron-LM — Distributed training systems.

🧱 GPU Systems Design Topics For Interview Prep

FlashAttention and PagedAttention
Matmul Operations
GPU scheduling algorithms and runtime systems.
Memory oversubscription and unified memory models.
Resource allocation in GPU clusters.
GPU virtualization
Kernel fusion and graph execution
Dataflow optimization
Persistent threads model

🧑‍💻 Contributors

Contributions welcome! Please read the contribution guidelines before submitting a pull request.

🧾 License

CC BY 4.0 — feel free to share and adapt with attribution.

⭐ Acknowledgements

Inspired by:

“GPU engineering is not just about writing kernels. It’s about understanding how systems work.” — Model Craft