I didn't know it was possible to compile llamacpp to run cuda + vulkan at the same time..

Reddit r/LocalLLaMA News

Summary

The author discovered that compiling llama.cpp with both CUDA and Vulkan backends simultaneously is possible, yielding a ~10% improvement in tokens/sec for decoding. They plan to run further benchmarks to assess the benefits.

cmake -B build -G "Visual Studio 17 2022" -A x64 -DCUDAToolkit\_ROOT="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v13.1" -DGGML\_CUDA=ON -DGGML\_VULKAN=ON -DGGML\_FLASH\_ATTN=ON -DGGML\_BLAS=OFF -DGGML\_NATIVE=OFF -DGGML\_RPC=ON -DGGML\_BACKEND\_DL=ON -DCMAKE\_CUDA\_ARCHITECTURES="120a" -DGGML\_CUDA\_FA\_ALL\_QUANTS=true I was optimizing Vulkan for the w7800 using ds4 on opencode and ds4 told me to "compile cuda + vulkan" to take advantage of the two cards. I had already tried cuda + hip but not vulkan. It's late now, but tomorrow I'll try some benchmarks to understand the real benefits. Has anyone tried it yet? I have about +10% tokens/sec (decoding) with MiniMax-M3-UD-IQ2\_M-00001-of-00004.gguf
Original Article

Similar Articles

Dual GPU llama.cpp speedup

Reddit r/LocalLLaMA

A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.