I didn't know it was possible to compile llamacpp to run cuda + vulkan at the same time..
Summary
The author discovered that compiling llama.cpp with both CUDA and Vulkan backends simultaneously is possible, yielding a ~10% improvement in tokens/sec for decoding. They plan to run further benchmarks to assess the benefits.
Similar Articles
@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…
User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.
sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp
A pull request for llama.cpp ports multi-column MMVQ from CUDA to SYCL, achieving approximately 45% speculative decoding speedup on Intel Arc GPUs.
Dual GPU llama.cpp speedup
A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.
Pipeline parallelism in llama.cpp may be wasting your VRAM
Testing shows that default pipeline parallelism in llama.cpp wastes VRAM with no speed benefit; compiling with GGML_SCHED_MAX_COPIES=1 saves significant VRAM while maintaining identical inference speed.
@ggerganov: Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp Over the last few months llama.cpp m…
llama.cpp maintainers and NVIDIA engineers collaborated to significantly improve multi-GPU performance in ggml, enabling hardware-agnostic tensor parallelism and major performance gains on RTX systems.