Can't believe I got it working! Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT

Reddit r/LocalLLaMA 05/22/26, 07:52 PM News

dual-gpu amd vulkan llama-cpp vram docker ai-inference

Summary

A user successfully set up a dual-GPU llama-cpp server with 48GB VRAM using an AMD Radeon PRO and 7800 XT via Vulkan in Docker on Kubuntu 24.04.

Setup: Kubuntu 24.04 - AMD cards - R9700 AI PRO and 7800xt (32gb + 16gb) - llama-cpp server - stack setup in docker - vulkan image I tried with ROCM but it wouldn't play nice with RDNA4 + RDNA3 mix. Vulkan seems to work. I tested a quick prompt, hopefully it's stable because if so, this gives me 48gb of VRAM to play with. Had to buy a new powersupply, but for $300 and to be able to leverage my older 7800xt - well worth it, I think.

Original Article

Similar Articles

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

X AI KOLs Timeline

A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

Reddit r/LocalLLaMA

A user shares their setup using two modded RTX 2080 Ti GPUs with 22GB VRAM each to run Qwen 3.6 27B at 38 tokens/s with llama.cpp, including tips on power limiting, tensor split mode, and KV cache settings.

Dual GPU llama.cpp speedup

Reddit r/LocalLLaMA

A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.

we really all are going to make it, aren't we? 2x3090 setup.

Reddit r/LocalLLaMA

A user shares their experience setting up a dual 3090 GPU system to run the Qwen 3.6 27b model locally, achieving over 100 tokens/second after switching to Ubuntu and using the club-3090 tool with custom patches. They express excitement about the future of local AI.

club-5060ti: practical RTX 5060 Ti local LLM notes and configs

Reddit r/LocalLLaMA

A GitHub repository providing practical configurations and benchmarks for running local LLMs (like Qwen3.6 27B) on dual RTX 5060 Ti 16GB cards using vLLM and llama.cpp.

Similar Articles

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

Dual GPU llama.cpp speedup

we really all are going to make it, aren't we? 2x3090 setup.

club-5060ti: practical RTX 5060 Ti local LLM notes and configs

Submit Feedback