User reports a CUDA error when using tensor split mode with the latest llama.cpp and Qwen-3.6-27b model on dual RTX 3090s with Ubuntu Server 24.04 and Docker.
Hi guys, I am running into issues when loading the Unsloth UD-Q8\_K\_XL quant and wanted to check if anyone has ran into this. I updated my config to also use --split-mode tensor but wanted to check if I need to update drivers/CUDA to get it working as I see that the tensor split mode fixes are merged into llama.cpp. Running dual 3090's on Ubuntu Server 24.04. `NVIDIA-SMI 580.159.03 Driver Version: 580.159.03 CUDA Version: 13.0` This is my config running in Docker with the latest llama.cpp image. `-c 32768` `--flash-attn on` `--n-gpu-layers 999` `--split-mode tensor` `--parallel 1` `--tensor-split 1,1` `--jinja` `--temp 0.6` `--top-p 0.95` `--min-p 0.01` `--top-k 20` `--presence-penalty 0.0` `--spec-type draft-mtp` `--spec-draft-n-max 2` `--no-mmap` `-np 1` This is the error I get when starting up `0.01.790.389 I common_init_result: fitting params to device memory ...` `0.01.790.389 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)` `0.01.790.459 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort` `0.12.433.663 W llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized` `0.12.604.320 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)` `/app/ggml/src/ggml-cuda/ggml-cuda.cu:103: CUDA error` `0.13.277.104 E CUDA error: unhandled system error (run with NCCL_DEBUG=INFO for details)` `0.13.277.108 E current device: 0, in function ggml_backend_cuda_comm_allreduce_nccl at /app/ggml/src/ggml-cuda/ggml-cuda.cu:1217` `0.13.277.108 E ncclGroupEnd()` `...`
Llama.cpp is expected to receive a fix for split mode tensor crashes on multi-GPU setups, which currently cause VRAM exhaustion every 90-120 minutes. The fix also reportedly brings a ~35% throughput improvement over layer mode.
A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.
A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.
A developer shares local inference benchmarks and systemd configurations for running the Qwen3.6-27B model on an NVIDIA RTX Pro 4500 Blackwell GPU using llama.cpp. The post requests optimization tips for throughput and explores potential use cases for larger models.
The author shares detailed tuning tips for running the Qwen3.6-35B-A3B MoE model on an 8GB RTX 3070 Ti with up to 262k context using llama.cpp, achieving 30+ tps, and notes a 25% speed boost when switching from Windows to Ubuntu Server.