Tag
vLLM v0.22.0 released with 459 commits, featuring DeepSeek V4 hardening, experimental Rust frontend, and batch-invariant Cutlass FP8, reducing end-to-end latency by 28.9%.
llama.cpp version b9387 introduces MFMA support for AMD CDNA architecture (MI100, MI200, MI300 series), improving processing pipeline performance on datacenter AMD GPUs.
A rejected PR for llama.cpp provides up to 30% faster prompt processing for MOE models on AMD Strix Halo hardware, with gains diminishing at higher context lengths.
Converted the Qwen 3.6 35b a3b model to ROCmfp4 format, leveraging MTP benefits for improved performance on AMD hardware.
A fork of llama.cpp integrating TurboQuant+ for advanced KV-cache and weight quantization, with cross-backend kernel support (Apple Silicon, NVIDIA CUDA, AMD ROCm, Vulkan) and used in production by LocalAI, Chronara, and AtomicChat.
This repository provides practical testing profiles and benchmarks for running local LLMs on 16GB AMD Radeon GPUs using llama.cpp with ROCm/HIP, focusing on real-world performance metrics like context length and KV cache settings.
Custom binary workaround enables flash attention on AMD RDNA2 GPUs for llama.cpp, doubling inference speed (70-80 tok/s vs stock crash). Only confirmed working with Qwen3.6 35B/27B.
Lemonade v10.5.1 adds MTP support and ROCm 7.13 quick start for Strix Halo, along with a Fedora 43 fix.
AMD's ROCm 7.13 tech preview adds optimizations for Strix Halo (Ryzen AI Max 300) and open-sources the ROCprof Trace Decoder.
Technical benchmark comparing ROCm and Vulkan backends for LLM inference on Strix Halo hardware after MTP merged into llama.cpp, revealing ROCm suffers severe performance drops at full context while Vulkan remains stable.
vLLM releases version 0.21.1rc0 with a focus on ROCm CI gating improvements.
A user reports that llama.cpp with ROCm consumes significantly more VRAM for the KV cache than the Vulkan backend, despite identical model and settings, prompting investigation into potential causes.
A developer gets TurboQuant TBQ4 KV cache and Multi-Token Prediction working on AMD ROCm for RDNA3 GPUs in llama.cpp, enabling 64k context on 24 GB VRAM with competitive token rates.
This article provides a tutorial on fine-tuning Large Language Models (LLMs) using AMD Strix Halo hardware, covering both Linux and native Windows environments with SFT and LoRA methods.
Lemonade has added an experimental ROCm backend for vLLM, allowing users to easily run safetensors LLMs on AMD GPUs with a simple command.
The author asks about the current viability of AMD's ROCm ecosystem for AI training in mid-2026, comparing it to NVIDIA's CUDA and asking if it has reached a 'just works' stage for PyTorch.