Tensor split mode: CUDA error on latest llama.cpp with Qwen-3.6-27b

Reddit r/LocalLLaMA News

Summary

User reports a CUDA error when using tensor split mode with the latest llama.cpp and Qwen-3.6-27b model on dual RTX 3090s with Ubuntu Server 24.04 and Docker.

Hi guys, I am running into issues when loading the Unsloth UD-Q8\_K\_XL quant and wanted to check if anyone has ran into this. I updated my config to also use --split-mode tensor but wanted to check if I need to update drivers/CUDA to get it working as I see that the tensor split mode fixes are merged into llama.cpp. Running dual 3090's on Ubuntu Server 24.04. `NVIDIA-SMI 580.159.03 Driver Version: 580.159.03 CUDA Version: 13.0` This is my config running in Docker with the latest llama.cpp image. `-c 32768` `--flash-attn on` `--n-gpu-layers 999` `--split-mode tensor` `--parallel 1` `--tensor-split 1,1` `--jinja` `--temp 0.6` `--top-p 0.95` `--min-p 0.01` `--top-k 20` `--presence-penalty 0.0` `--spec-type draft-mtp` `--spec-draft-n-max 2` `--no-mmap` `-np 1` This is the error I get when starting up `0.01.790.389 I common_init_result: fitting params to device memory ...` `0.01.790.389 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)` `0.01.790.459 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort` `0.12.433.663 W llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized` `0.12.604.320 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)` `/app/ggml/src/ggml-cuda/ggml-cuda.cu:103: CUDA error` `0.13.277.104 E CUDA error: unhandled system error (run with NCCL_DEBUG=INFO for details)` `0.13.277.108 E current device: 0, in function ggml_backend_cuda_comm_allreduce_nccl at /app/ggml/src/ggml-cuda/ggml-cuda.cu:1217` `0.13.277.108 E ncclGroupEnd()` `...`
Original Article

Similar Articles

Llama.cpp : Split Mode Tensor Fix Incoming?

Reddit r/LocalLLaMA

Llama.cpp is expected to receive a fix for split mode tensor crashes on multi-GPU setups, which currently cause VRAM exhaustion every 90-120 minutes. The fix also reportedly brings a ~35% throughput improvement over layer mode.

Dual GPU llama.cpp speedup

Reddit r/LocalLLaMA

A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.

RTX Pro 4500 Blackwell - Qwen 3.6 27B?

Reddit r/LocalLLaMA

A developer shares local inference benchmarks and systemd configurations for running the Qwen3.6-27B model on an NVIDIA RTX Pro 4500 Blackwell GPU using llama.cpp. The post requests optimization tips for throughput and explores potential use cases for larger models.

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Reddit r/LocalLLaMA

The author shares detailed tuning tips for running the Qwen3.6-35B-A3B MoE model on an 8GB RTX 3070 Ti with up to 262k context using llama.cpp, achieving 30+ tps, and notes a 25% speed boost when switching from Windows to Ubuntu Server.