qwen3.6 突然中断

Reddit r/LocalLLaMA 2026/05/13 13:30 新闻

qwen3.6 vllm bug-report self-hosting gpu-inference docker ai-infrastructure

摘要

用户报告在使用 vLLM 配合特定 Docker 配置及投机解码（speculative decoding）部署 Qwen 3.6 模型时，模型会在任务中途停止生成。

https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b 有时 qwen 3.6 会在任务进行到一半时突然停止，有什么办法可以避免这种情况吗？我使用的是 qwen-code CLI，但在 opencode 上也出现了同样的问题。使用 Docker Compose 运行 vLLM： services: vllm-qwen36-27b-dual-dflash-noviz: image: vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657 container_name: vllm-qwen36-27b-dual-dflash-noviz restart: on-failure ports: - "${BIND_HOST:-0.0.0.0}:${PORT:-8080}:8000" volumes: - ${MODEL_DIR:-/home/ai/models/vllm}:/root/.cache/huggingface - /home/ai/club-3090/models/qwen3.6-27b/vllm/cache/torch_compile:/root/.cache/vllm/torch_compile_cache - /home/ai/club-3090/models/qwen3.6-27b/vllm/cache/triton:/root/.triton/cache - /home/ai/club-3090/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/marlin.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py:ro - /home/ai/club-3090/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/MPLinearKernel.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/MPLinearKernel.py:ro environment: - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-} - CUDA_DEVICE_ORDER=PCI_BUS_ID - VLLM_WORKER_MULTIPROC_METHOD=spawn - NCCL_CUMEM_ENABLE=0 - NCCL_P2P_DISABLE=1 - VLLM_NO_USAGE_STATS=1 - VLLM_USE_FLASHINFER_SAMPLER=1 - OMP_NUM_THREADS=1 - PYTORCH_CUDA_ALLOC_CONF=${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True,max_split_size_mb:512} shm_size: "16gb" ipc: host deploy: resources: reservations: devices: - driver: nvidia device_ids: ["0", "2"] capabilities: [gpu] entrypoint: - /bin/bash - -c - | exec vllm serve ${VLLM_ENFORCE_EAGER:+--enforce-eager} "$@" - -- command: - --model - /root/.cache/huggingface/qwen3.6-27b-autoround-int4 - --served-model-name - qwen - --quantization - auto_round - --dtype - bfloat16 - --tensor-parallel-size - "2" - --disable-custom-all-reduce - --max-model-len - "${MAX_MODEL_LEN:-185000}" - --gpu-memory-utilization - "${GPU_MEMORY_UTILIZATION:-0.95}" - --max-num-seqs - "${MAX_NUM_SEQS:-2}" - --max-num-batched-tokens - "8192" - --language-model-only - --trust-remote-code - --reasoning-parser - qwen3 - --default-chat-template-kwargs - '{"enable_thinking": true}' - --enable-auto-tool-choice - --tool-call-parser - qwen3_coder - --enable-prefix-caching - --enable-chunked-prefill - --speculative-config - '{"method":"dflash","model":"/root/.cache/huggingface/qwen3.6-27b-dflash","num_speculative_tokens":5}' - --host - 0.0.0.0 - --port - "8000" 基于 [https://github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090) 有什么改进的建议吗？

查看原文

qwen3.6 突然中断

相似文章

Qwen3.6 27B 在 vLLM 中的表现比在 llama.cpp 中更差

QWEN3.6 + ik_llama 快得离谱

Qwen-3.6-27B + llamacpp 投机解码效果惊艳

Llama.cpp 服务器连续运行约两周后表现失常？

在搭载RTX 4060（8GB）的笔记本电脑上运行Qwen3.6-35B-A3B——哪些有效、哪些无效以及一个令人意外的推测解码结果

提交意见反馈