I didn't know it was possible to compile llamacpp to run cuda + vulkan at the same time..

Reddit r/LocalLLaMA 06/16/26, 10:10 PM News

llamacpp cuda vulkan multi-gpu compilation optimization llm-inference

Summary

The author discovered that compiling llama.cpp with both CUDA and Vulkan backends simultaneously is possible, yielding a ~10% improvement in tokens/sec for decoding. They plan to run further benchmarks to assess the benefits.

cmake -B build -G "Visual Studio 17 2022" -A x64 -DCUDAToolkit\_ROOT="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" -DGGML\_CUDA=ON -DGGML\_VULKAN=ON -DGGML\_FLASH\_ATTN=ON -DGGML\_BLAS=OFF -DGGML\_NATIVE=OFF -DGGML\_RPC=ON -DGGML\_BACKEND\_DL=ON -DCMAKE\_CUDA\_ARCHITECTURES="120a" -DGGML\_CUDA\_FA\_ALL\_QUANTS=true I was optimizing Vulkan for the w7800 using ds4 on opencode and ds4 told me to "compile cuda + vulkan" to take advantage of the two cards. I had already tried cuda + hip but not vulkan. It's late now, but tomorrow I'll try some benchmarks to understand the real benefits. Has anyone tried it yet? I have about +10% tokens/sec (decoding) with MiniMax-M3-UD-IQ2\_M-00001-of-00004.gguf

Original Article

Similar Articles

@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…

X AI KOLs Following

User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.

sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

Reddit r/LocalLLaMA

A pull request for llama.cpp ports multi-column MMVQ from CUDA to SYCL, achieving approximately 45% speculative decoding speedup on Intel Arc GPUs.

Dual GPU llama.cpp speedup

Reddit r/LocalLLaMA

A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.

Pipeline parallelism in llama.cpp may be wasting your VRAM

Reddit r/LocalLLaMA

Testing shows that default pipeline parallelism in llama.cpp wastes VRAM with no speed benefit; compiling with GGML_SCHED_MAX_COPIES=1 saves significant VRAM while maintaining identical inference speed.

@ggerganov: Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp Over the last few months llama.cpp m…

X AI KOLs Following

llama.cpp maintainers and NVIDIA engineers collaborated to significantly improve multi-GPU performance in ggml, enabling hardware-agnostic tensor parallelism and major performance gains on RTX systems.

Similar Articles

@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…

sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

Dual GPU llama.cpp speedup

Pipeline parallelism in llama.cpp may be wasting your VRAM

@ggerganov: Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp Over the last few months llama.cpp m…

Submit Feedback