NCCL-Free Tensor Parallelism on Dual Blackwell PCIe llama.cpp b9095 released!
Summary
llama.cpp build b9095 introduces NCCL-free tensor parallelism for dual Blackwell PCIe GPUs, enabling efficient multi-GPU inference without relying on NCCL.
Similar Articles
Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni
A developer toolkit providing configurations, wheels, and benchmarks for running large language models with NVFP4 precision on Nvidia Blackwell GPUs using TensorRT-LLM.
RTX Pro 4500 Blackwell - Qwen 3.6 27B?
A developer shares local inference benchmarks and systemd configurations for running the Qwen3.6-27B model on an NVIDIA RTX Pro 4500 Blackwell GPU using llama.cpp. The post requests optimization tips for throughput and explores potential use cases for larger models.
@zcbenz: We have achieved a milestone in MLX that all tests are passing in CUDA backend now.
MLX has reached a milestone where all tests pass on the CUDA backend, indicating improved compatibility with NVIDIA GPUs.
@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…
User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.
@pupposandro: https://x.com/pupposandro/status/2054241934164492328
The article announces support for DFlash and PFlash speculative decoding in llama.cpp for AMD Strix Halo iGPUs, demonstrating significant speedups in inference performance using ROCm.