张量拆分模式：在最新版 llama.cpp 中使用 Qwen-3.6-27b 时出现 CUDA 错误

Reddit r/LocalLLaMA 2026/06/03 12:38 新闻

llama-cpp cuda-error tensor-split-mode multi-gpu qwen-3-6-27b ubuntu docker

摘要

用户在最新的llama.cpp和Qwen-3.6-27b模型下使用tensor split模式时，在配备双RTX 3090、Ubuntu Server 24.04和Docker的环境中报告了CUDA错误。

大家好，我在加载 Unsloth UD-Q8\_K\_XL 量化版本时遇到了问题，想问问是否有人遇到过。我已经更新配置添加了 `--split-mode tensor`，但想确认是否需要更新驱动/CUDA 才能正常使用，因为我知道张量拆分模式的修复已经合并到 llama.cpp 中。目前我在 Ubuntu Server 24.04 上运行双 3090。`NVIDIA-SMI 580.159.03 Driver Version: 580.159.03 CUDA Version: 13.0` 以下是在 Docker 中使用最新 llama.cpp 镜像的配置。 ``` -c 32768 --flash-attn on --n-gpu-layers 999 --split-mode tensor --parallel 1 --tensor-split 1,1 --jinja --temp 0.6 --top-p 0.95 --min-p 0.01 --top-k 20 --presence-penalty 0.0 --spec-type draft-mtp --spec-draft-n-max 2 --no-mmap -np 1 ``` 启动时出现以下错误： ``` 0.01.790.389 I common_init_result: fitting params to device memory ... 0.01.790.389 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on) 0.01.790.459 W common_fit_params: failed to fit params to free device memory: llama_params_fit is not implemented for SPLIT_MODE_TENSOR, abort 0.12.433.663 W llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized 0.12.604.320 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) /app/ggml/src/ggml-cuda/ggml-cuda.cu:103: CUDA error 0.13.277.104 E CUDA error: unhandled system error (run with NCCL_DEBUG=INFO for details) 0.13.277.108 E current device: 0, in function ggml_backend_cuda_comm_allreduce_nccl at /app/ggml/src/ggml-cuda/ggml-cuda.cu:1217 0.13.277.108 E ncclGroupEnd() ... ```

查看原文

张量拆分模式：在最新版 llama.cpp 中使用 Qwen-3.6-27b 时出现 CUDA 错误

相似文章

Llama.cpp：拆分模式张量修复即将到来？

双GPU llama.cpp加速

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

RTX Pro 4500 Blackwell - Qwen 3.6 27B？

Qwen3.6-35B-A3B Q4 262k上下文，8GB 3070 Ti上可达+30tps

提交意见反馈