rocm

#rocm

@0x0SojalSec: Fuck your paid courses, Master GPU engineering for AI systems. From foundational books and CUDA/ROCm programming to low…

X AI KOLs Timeline ↗ · 2d ago Cached

A curated list of resources for mastering GPU engineering for AI systems, covering CUDA, ROCm, optimization tools, multi-GPU orchestration, and distributed training.

0 favorites 0 likes

#rocm

@DanKornas: GPU engineering is too broad to learn from random tabs. Awesome GPU Engineering is a curated GitHub list of resources f…

X AI KOLs Timeline ↗ · 6d ago Cached

A curated GitHub list of resources for learning GPU engineering, covering architecture, kernel programming, optimization, distributed systems, and AI acceleration with books, frameworks, profilers, and interview prep.

0 favorites 0 likes

#rocm

AMD Strix Halo RDMA Cluster Setup Guide

Hacker News Top ↗ · 2026-06-28 Cached

A setup guide for using a custom Docker/Podman toolbox with ROCm/RCCL RDMA support to cluster two AMD Strix Halo nodes, enabling vLLM with tensor parallelism across 256GB unified memory.

0 favorites 0 likes

#rocm

If LLMs are so good at coding…

Reddit r/LocalLLaMA ↗ · 2026-06-25

A discussion questioning why LLMs haven't helped ROCm and Intel's software ecosystems catch up to CUDA, highlighting NVIDIA's premium pricing and the need for genuine market competition.

0 favorites 0 likes

#rocm

Big News for AMD / Strix Halo+ Owners

Reddit r/LocalLLaMA ↗ · 2026-06-24

The NPU on AMD Strix Halo devices is now usable for AI inference, enabling hybrid mode that combines NPU and iGPU for faster prompt processing. Tools like Lemonade and AMD's ROCm software make this possible.

0 favorites 0 likes

#rocm

ROCm vs Vulkan vs vLLM on Dual R9700's

Reddit r/LocalLLaMA ↗ · 2026-06-21

A comparison of AI inference frameworks ROCm, Vulkan, and vLLM running on dual AMD Radeon 9700 GPUs, likely benchmarking performance for large language models.

0 favorites 0 likes

#rocm

Benchmarks from the latest eBay special: W6800 (modded V620)

Reddit r/LocalLLaMA ↗ · 2026-06-17

A user benchmarks a modded AMD V620 GPU flashed with W6800 firmware and a custom blower fan for running LLMs via Vulkan and ROCm backends, comparing performance on Qwen2.5-27B at various quantization levels.

0 favorites 0 likes

#rocm

@vllm_project: vLLM v0.22.0 is out! 459 commits from 230 contributors (63 new). Highlights: DeepSeek V4 hardening (NVFP4 fused MoE, fu…

X AI KOLs Timeline ↗ · 2026-05-30 Cached

vLLM v0.22.0 released with 459 commits, featuring DeepSeek V4 hardening, experimental Rust frontend, and batch-invariant Cutlass FP8, reducing end-to-end latency by 28.9%.

0 favorites 0 likes

#rocm

llama.cpp B9387 Significant AMD/ROCm PP Update

Reddit r/LocalLLaMA ↗ · 2026-05-29

llama.cpp version b9387 introduces MFMA support for AMD CDNA architecture (MI100, MI200, MI300 series), improving processing pipeline performance on datacenter AMD GPUs.

0 favorites 0 likes

#rocm

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

Reddit r/LocalLLaMA ↗ · 2026-05-26

A rejected PR for llama.cpp provides up to 30% faster prompt processing for MOE models on AMD Strix Halo hardware, with gains diminishing at higher context lengths.

0 favorites 0 likes

#rocm

@Italianclownz: Converted Qwen 3.6 35b a3b to ROCmfp4 and this is flying. Used the mtp version bc this ROCmfp4 can also incorporate the…

X AI KOLs Timeline ↗ · 2026-05-24 Cached

Converted the Qwen 3.6 35b a3b model to ROCmfp4 format, leveraging MTP benefits for improved performance on AMD hardware.

0 favorites 0 likes

#rocm

@no_stp_on_snek: got it here if ya want to try it out:

X AI KOLs Following ↗ · 2026-05-23 Cached

A fork of llama.cpp integrating TurboQuant+ for advanced KV-cache and weight quantization, with cross-backend kernel support (Apple Silicon, NVIDIA CUDA, AMD ROCm, Vulkan) and used in production by LocalAI, Chronara, and AtomicChat.

0 favorites 0 likes

#rocm

club-rdna16: practical 16GB AMD/Radeon local LLM testing repo

Reddit r/LocalLLaMA ↗ · 2026-05-23

This repository provides practical testing profiles and benchmarks for running local LLMs on 16GB AMD Radeon GPUs using llama.cpp with ROCm/HIP, focusing on real-world performance metrics like context length and KV cache settings.

0 favorites 0 likes

#rocm

RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed

Reddit r/LocalLLaMA ↗ · 2026-05-19

Custom binary workaround enables flash attention on AMD RDNA2 GPUs for llama.cpp, doubling inference speed (70-80 tok/s vs stock crash). Only confirmed working with Qwen3.6 35B/27B.

0 favorites 0 likes

#rocm

Lemonade v10.5.1: an MTP + ROCm 7.13 quick start for Strix Halo

Reddit r/LocalLLaMA ↗ · 2026-05-18

Lemonade v10.5.1 adds MTP support and ROCm 7.13 quick start for Strix Halo, along with a Fedora 43 fix.

0 favorites 0 likes

#rocm

ROCm 7.13 nightly adds strix halo optimizations

Reddit r/LocalLLaMA ↗ · 2026-05-17

AMD's ROCm 7.13 tech preview adds optimizations for Strix Halo (Ryzen AI Max 300) and open-sources the ROCprof Trace Decoder.

0 favorites 0 likes

#rocm

Strix Halo ROCm + MTP Notes (May 2026)

Reddit r/LocalLLaMA ↗ · 2026-05-17

Technical benchmark comparing ROCm and Vulkan backends for LLM inference on Strix Halo hardware after MTP merged into llama.cpp, revealing ROCm suffers severe performance drops at full context while Vulkan remains stable.

0 favorites 0 likes

#rocm

vllm-project/vllm v0.21.1rc0: [ROCm][CI] Stage B gating (#42025)

GitHub Releases Watchlist ↗ · 2026-05-15 Cached

vLLM releases version 0.21.1rc0 with a focus on ROCm CI gating improvements.

0 favorites 0 likes

#rocm

Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?

Reddit r/LocalLLaMA ↗ · 2026-05-14

A user reports that llama.cpp with ROCm consumes significantly more VRAM for the KV cache than the Vulkan backend, despite identical model and settings, prompting investigation into potential causes.

0 favorites 0 likes

#rocm

Turboquant+MTP for ROCm(Llama CPP)

Reddit r/LocalLLaMA ↗ · 2026-05-14

A developer gets TurboQuant TBQ4 KV cache and Multi-Token Prediction working on AMD ROCm for RDNA3 GPUs in llama.cpp, enabling 64k context on 24 GB VRAM with competitive token rates.

0 favorites 0 likes

rocm

Submit Feedback