Tag
llama.cpp b9158 has been released with a fix for Flash Attention on RDNA3 GPUs, improving performance for AMD users.
A user shares a fix for performance bottlenecks when running AI models on AMD GPUs in Windows 11 by disabling memory compression via the command 'Disable-mmagent -mc'.
Benchmark results for running Qwen 3.6 27B on AMD MI50 GPUs using a custom vllm fork, achieving 52.8 tokens/s TG and 1569 tokens/s PP without quantization or MTP, demonstrating usability for agentic tasks on 2018 hardware.
A user discusses the trade-offs between using vLLM and llama.cpp for local, single-user inference on AMD hardware, questioning if vLLM's performance benefits justify the complexity in non-enterprise settings.
A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.
The author asks about the current viability of AMD's ROCm ecosystem for AI training in mid-2026, comparing it to NVIDIA's CUDA and asking if it has reached a 'just works' stage for PyTorch.
A user demonstrates Qwen 3.6 running autonomously on an AMD 7900 XTX GPU, locally creating an Android app — described as a sci-fi reality achieved today.