@0x0SojalSec: Fuck your paid courses, Master GPU engineering for AI systems. From foundational books and CUDA/ROCm programming to low…
Summary
A curated list of resources for mastering GPU engineering for AI systems, covering CUDA, ROCm, optimization tools, multi-GPU orchestration, and distributed training.
View Cached Full Text
Cached at: 07/03/26, 08:32 AM
Fuck your paid courses, Master GPU engineering for AI systems.
From foundational books and CUDA/ROCm programming to low-level optimization, Nsight tools, multi-GPU orchestration, distributed training and AI acceleration techniques.
Excellent reference for embedded GPU work or large-scale AI infrastructure, curated collection covering:
- CUDA & ROCm programming
- Kernel optimization & performance tools
- Multi-GPU systems & distributed training
- Architecture deep dives, Triton, CUTLASS, and more
A goldmine for anyone working on high-performance AI infrastructure, kernel development, or systems-level GPU work.
- http://github.com/goabiaryan/awesome-gpu-engineering…
goabiaryan/awesome-gpu-engineering
Source: https://github.com/goabiaryan/awesome-gpu-engineering
Awesome GPU Engineering 
A curated list of resources for mastering GPU engineering from architecture and kernel programming to large-scale distributed systems and AI acceleration.
📘 Foundational Books
- Programming Massively Parallel Processors: A Hands-on Approach — David B. Kirk & Wen-mei W. Hwu The canonical introduction to CUDA, memory hierarchies, and parallel patterns. Amazon , notes: Abi’s Concise Notes
- CUDA by Example — Jason Sanders & Edward Kandrot
A practical introduction to CUDA for beginners. Amazon - The Ultra-Scale Playbook: Training LLMs on GPU Clusters - Hugging Face Web Version
💻 GPU Programming Frameworks
- CUDA — NVIDIA’s proprietary GPU programming platform.
- ROCm — AMD’s open compute stack.
- OpenCL — Cross-platform parallel computing standard.
- SYCL / oneAPI — Intel’s C++ abstraction for heterogeneous compute.
- Vulkan Compute — Low-level GPU compute API.
- Kompute — Higher level general purpose GPU compute framework built on Vulkan.
- Metal Performance Shaders — Apple’s GPU framework.
- Mojo🔥 - Write like Python, run like C++.
🧩 Optimization and Performance
- NVIDIA Nsight Systems — System-wide GPU profiler.
- Nsight Compute — Kernel-level performance analysis.
- Occupancy Calculator — NVIDIA spreadsheet for kernel configuration.
- CUTLASS — CUDA templates for linear algebra subroutines.
- TensorRT — High-performance deep learning inference.
- OpenAI Triton — Python DSL for writing high-performance GPU kernels.
- Helion - A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
- Roofline Model — Analytical model to reason about compute/memory bottlenecks.
🧠 Architecture and Low-Level Design
- NVIDIA Ampere Whitepaper
- AMD RDNA & CDNA Architectures
- SIMT execution and warp scheduling
- Memory hierarchy and coalescing
- Shared memory and cache optimization
- Warp divergence and thread occupancy
⚙️ Systems and Multi-GPU Engineering
- NCCL — Multi-GPU communication primitives.
- vLLM - Inference and serving engine for LLMs
- Hugging Face Accelerate - Simplify abstractions for distributed training
- SGLang
- Prime Intellect
- TensorRT-LLM
- TGI by Hugging Face
- Horovod — Distributed deep learning across GPUs.
- NVLink & PCIe Topology — GPU interconnects and bandwidth optimization.
- GPUDirect RDMA — Zero-copy GPU networking.
- Ray Train, DeepSpeed, Megatron-LM — Large-scale GPU orchestration frameworks.
- Iris by AMD - open-source multi-GPU programming framework built for compiler-visible performance and optimized multi-GPU execution.
🧪 Tutorials and Courses
- CUDA C++ Programming Guide
- Triton Tutorials (OpenAI)
- CUDA in 12 hours by FreeCodeCamp and Video Repo
- Stanford CS149, Fall 2025 Parallel Computing Course Fall 2025
- CMU 15-418/618: Parallel Computer Architecture & Programming
- MIT 6.5940: TinyML and Efficient Deep Learning Computing
- GPU MODE video lecture series
- Red Hat vLLM Office Hours video series
- The courses of the Programming Massively Parallel Processors book’s authors
📄 Research Papers and Articles
- Optimization techniques for GPU programming - Hijma, Pieter, et al.
- Efficient Multi-GPU Programming in Python: Reducing Synchronization and Access Overheads - Oden, Lena, and Klaus Nölp
- Evolving GPU Architecture — Kirk & Hwu
- Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision - Wei Gao et al
- Optimizing Machine Learning Models with CUDA: A Comprehensive Performance Analysis - Niteesh, L., and M. B. Ampareeshan
- NVIDIA Research Papers on Model Parallelism and Megatron-LM
- GPU Virtualization and Multi-Tenant Scheduling
- A Survey of Multi-Tenant Deep Learning Inference on GPU
- Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception
🧰 Tools and Utilities
- nvprof, nvvp, Nsight Systems / Compute — NVIDIA profiling tools.
- cuda-memcheck, compute-sanitizer — Memory and correctness tools.
- GPGPU-Sim, Accel-Sim — GPU simulation frameworks.
- Ingero — eBPF-based GPU causal observability agent. Traces CUDA Runtime/Driver APIs and host kernel events to build causal chains explaining GPU latency. <2% overhead, production-safe.
- Perfetto, Nsight UI — Visual profilers for tracing GPU workloads.
Learning Tools
- Tensara
- LeetGPU
- GPU MODE Discord
- GPU Glossary - A dictionary of terms related to programming GPUs
- Mojo🔥 GPU Puzzles
🧑🔬 GPU for AI & ML
- PyTorch CUDA Extensions — Custom kernels for PyTorch.
- JAX + XLA — Compiler-based GPU vectorization.
- TensorFlow XLA Compiler — Ahead-of-time GPU graph compilation.
- FlashAttention, FlashConv — Kernel optimization techniques for transformers.
- DeepSpeed, FSDP, Megatron-LM — Distributed training systems.
🧱 GPU Systems Design Topics For Interview Prep
- FlashAttention and PagedAttention
- Matmul Operations
- GPU scheduling algorithms and runtime systems.
- Memory oversubscription and unified memory models.
- Resource allocation in GPU clusters.
- GPU virtualization
- Kernel fusion and graph execution
- Dataflow optimization
- Persistent threads model
🧑💻 Contributors
Contributions welcome! Please read the contribution guidelines before submitting a pull request.
🧾 License
CC BY 4.0 — feel free to share and adapt with attribution.
⭐ Acknowledgements
Inspired by:
“GPU engineering is not just about writing kernels. It’s about understanding how systems work.” — Model Craft
Similar Articles
@DanKornas: GPU engineering is too broad to learn from random tabs. Awesome GPU Engineering is a curated GitHub list of resources f…
A curated GitHub list of resources for learning GPU engineering, covering architecture, kernel programming, optimization, distributed systems, and AI acceleration with books, frameworks, profilers, and interview prep.
@mdancho84: BREAKING: Google just DROPPED a masterclass on GPUs Get it here 100% free:
Google has released a free masterclass on GPUs, covering GPU architecture and deep learning acceleration.
@zostaff: https://x.com/zostaff/status/2065069139341742588
This article maps the optimal AI-augmented path to becoming a GPU/CUDA engineer, highlighting compensation ranges and the growing demand for inference optimization specialists. It provides a realistic timeline and emphasizes the use of AI tools to accelerate learning.
@pauliusztin_: I just found one of the most useful resources for understanding GPUs. No more jumping between random docs, PDFs, and fo…
Modal Labs has released an open-source, interlinked GPU glossary that consolidates fragmented NVIDIA documentation, CUDA details, and compiler flags into a single navigable resource for engineers optimizing LLM training and inference.
CUDA Books
A curated list of major books on CUDA programming covering beginner to advanced topics, including C++ and Python, with focus on practical resources for NVIDIA GPU parallel computing.