Can't believe I got it working! Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT
Summary
A user successfully set up a dual-GPU llama-cpp server with 48GB VRAM using an AMD Radeon PRO and 7800 XT via Vulkan in Docker on Kubuntu 24.04.
Similar Articles
@leopardracer: https://x.com/leopardracer/status/2055341758523883631
A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.
2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache
A user shares their setup using two modded RTX 2080 Ti GPUs with 22GB VRAM each to run Qwen 3.6 27B at 38 tokens/s with llama.cpp, including tips on power limiting, tensor split mode, and KV cache settings.
Dual GPU llama.cpp speedup
A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.
we really all are going to make it, aren't we? 2x3090 setup.
A user shares their experience setting up a dual 3090 GPU system to run the Qwen 3.6 27b model locally, achieving over 100 tokens/second after switching to Ubuntu and using the club-3090 tool with custom patches. They express excitement about the future of local AI.
club-5060ti: practical RTX 5060 Ti local LLM notes and configs
A GitHub repository providing practical configurations and benchmarks for running local LLMs (like Qwen3.6 27B) on dual RTX 5060 Ti 16GB cards using vLLM and llama.cpp.