Tag
An Android phone is repurposed as a portable GGUF inference server with Vulkan acceleration, exposing an OpenAI-compatible endpoint via LiteLLM and Tailscale mesh for integration into a self-hosted AI cluster.
A fork of llama.cpp integrating TurboQuant+ for advanced KV-cache and weight quantization, with cross-backend kernel support (Apple Silicon, NVIDIA CUDA, AMD ROCm, Vulkan) and used in production by LocalAI, Chronara, and AtomicChat.
User benchmarks the APEX quantized version of Gemma4 26B A4B model on AMD RX 9060 XT, achieving 38 tps at 90k context with no quality degradation, finding it better than previous quantizations.
A user successfully set up a dual-GPU llama-cpp server with 48GB VRAM using an AMD Radeon PRO and 7800 XT via Vulkan in Docker on Kubuntu 24.04.
Technical benchmark comparing ROCm and Vulkan backends for LLM inference on Strix Halo hardware after MTP merged into llama.cpp, revealing ROCm suffers severe performance drops at full context while Vulkan remains stable.
A user reports that llama.cpp with ROCm consumes significantly more VRAM for the KV cache than the Vulkan backend, despite identical model and settings, prompting investigation into potential causes.
User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.
Neon Sovereign is a native C++20/Vulkan autonomous software development workstation that uses a multi-agent swarm to execute software briefs end-to-end, running local LLM weights via Ollama/GGUF with no cloud dependency. The creator is seeking systems engineers and early testers as it enters Active Alpha.
Community benchmark shows Intel Arc Pro B70 averages ~71% slower prompt processing and ~54% slower token generation than RTX 3090 under llama.cpp, with SYCL backend sometimes beating Vulkan on the same card.