@ggerganov: Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp Over the last few months llama.cpp m…
Summary
llama.cpp maintainers and NVIDIA engineers collaborated to significantly improve multi-GPU performance in ggml, enabling hardware-agnostic tensor parallelism and major performance gains on RTX systems.
Similar Articles
NCCL-Free Tensor Parallelism on Dual Blackwell PCIe llama.cpp b9095 released!
llama.cpp build b9095 introduces NCCL-free tensor parallelism for dual Blackwell PCIe GPUs, enabling efficient multi-GPU inference without relying on NCCL.
Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split
A user benchmarks dual-GPU inference speed on two RTX 3080 20GB using llama.cpp (row/tensor split) and ik_llama (graph split) with a Qwen3.6-27B GGUF model, comparing token generation and prompt processing speeds.
CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp
This pull request adds a fast Walsh-Hadamard transform implementation for CUDA in llama.cpp, a popular open-source LLM inference engine. The optimization enhances performance for certain computational operations on NVIDIA GPUs.
@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…
User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.
@ggerganov: llama.cpp adds MTP for the Qwen3.6 family This is a significant milestone for the local AI ecosystem. The performance j…
llama.cpp adds Multi-Token Prediction (MTP) support for the Qwen3.6 family, delivering massive performance improvements for local AI inference on commodity hardware.