Tag
A new packed16 K technique for llama.cpp on RDNA3 GPUs reduces KV cache VRAM by 47% compared to Vulkan fp16, using int8 packing and native dot4 instructions to maintain fp16-quality K values with minimal KLD loss.
hipEngine is a new open-source ROCm-native LLM inference engine for AMD RDNA3 GPUs, offering competitive prefill and decode performance for Qwen 3.6 models compared to llama.cpp.
llama.cpp b9158 has been released with a fix for Flash Attention on RDNA3 GPUs, improving performance for AMD users.
A developer gets TurboQuant TBQ4 KV cache and Multi-Token Prediction working on AMD ROCm for RDNA3 GPUs in llama.cpp, enabling 64k context on 24 GB VRAM with competitive token rates.