Tag
A curated list of resources for mastering GPU engineering for AI systems, covering CUDA, ROCm, optimization tools, multi-GPU orchestration, and distributed training.
A curated GitHub list of resources for learning GPU engineering, covering architecture, kernel programming, optimization, distributed systems, and AI acceleration with books, frameworks, profilers, and interview prep.
A setup guide for using a custom Docker/Podman toolbox with ROCm/RCCL RDMA support to cluster two AMD Strix Halo nodes, enabling vLLM with tensor parallelism across 256GB unified memory.
A discussion questioning why LLMs haven't helped ROCm and Intel's software ecosystems catch up to CUDA, highlighting NVIDIA's premium pricing and the need for genuine market competition.
The NPU on AMD Strix Halo devices is now usable for AI inference, enabling hybrid mode that combines NPU and iGPU for faster prompt processing. Tools like Lemonade and AMD's ROCm software make this possible.
A comparison of AI inference frameworks ROCm, Vulkan, and vLLM running on dual AMD Radeon 9700 GPUs, likely benchmarking performance for large language models.
A user benchmarks a modded AMD V620 GPU flashed with W6800 firmware and a custom blower fan for running LLMs via Vulkan and ROCm backends, comparing performance on Qwen2.5-27B at various quantization levels.
vLLM v0.22.0 released with 459 commits, featuring DeepSeek V4 hardening, experimental Rust frontend, and batch-invariant Cutlass FP8, reducing end-to-end latency by 28.9%.
llama.cpp version b9387 introduces MFMA support for AMD CDNA architecture (MI100, MI200, MI300 series), improving processing pipeline performance on datacenter AMD GPUs.
A rejected PR for llama.cpp provides up to 30% faster prompt processing for MOE models on AMD Strix Halo hardware, with gains diminishing at higher context lengths.
Converted the Qwen 3.6 35b a3b model to ROCmfp4 format, leveraging MTP benefits for improved performance on AMD hardware.
A fork of llama.cpp integrating TurboQuant+ for advanced KV-cache and weight quantization, with cross-backend kernel support (Apple Silicon, NVIDIA CUDA, AMD ROCm, Vulkan) and used in production by LocalAI, Chronara, and AtomicChat.
This repository provides practical testing profiles and benchmarks for running local LLMs on 16GB AMD Radeon GPUs using llama.cpp with ROCm/HIP, focusing on real-world performance metrics like context length and KV cache settings.
Custom binary workaround enables flash attention on AMD RDNA2 GPUs for llama.cpp, doubling inference speed (70-80 tok/s vs stock crash). Only confirmed working with Qwen3.6 35B/27B.
Lemonade v10.5.1 adds MTP support and ROCm 7.13 quick start for Strix Halo, along with a Fedora 43 fix.
AMD's ROCm 7.13 tech preview adds optimizations for Strix Halo (Ryzen AI Max 300) and open-sources the ROCprof Trace Decoder.
Technical benchmark comparing ROCm and Vulkan backends for LLM inference on Strix Halo hardware after MTP merged into llama.cpp, revealing ROCm suffers severe performance drops at full context while Vulkan remains stable.
vLLM releases version 0.21.1rc0 with a focus on ROCm CI gating improvements.
A user reports that llama.cpp with ROCm consumes significantly more VRAM for the KV cache than the Vulkan backend, despite identical model and settings, prompting investigation into potential causes.
A developer gets TurboQuant TBQ4 KV cache and Multi-Token Prediction working on AMD ROCm for RDNA3 GPUs in llama.cpp, enabling 64k context on 24 GB VRAM with competitive token rates.