qwen3.6 突然中断

Reddit r/LocalLLaMA 新闻

摘要

用户报告在使用 vLLM 配合特定 Docker 配置及投机解码(speculative decoding)部署 Qwen 3.6 模型时,模型会在任务中途停止生成。

https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b 有时 qwen 3.6 会在任务进行到一半时突然停止,有什么办法可以避免这种情况吗?我使用的是 qwen-code CLI,但在 opencode 上也出现了同样的问题。使用 Docker Compose 运行 vLLM: services: vllm-qwen36-27b-dual-dflash-noviz: image: vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657 container_name: vllm-qwen36-27b-dual-dflash-noviz restart: on-failure ports: - "${BIND_HOST:-0.0.0.0}:${PORT:-8080}:8000" volumes: - ${MODEL_DIR:-/home/ai/models/vllm}:/root/.cache/huggingface - /home/ai/club-3090/models/qwen3.6-27b/vllm/cache/torch_compile:/root/.cache/vllm/torch_compile_cache - /home/ai/club-3090/models/qwen3.6-27b/vllm/cache/triton:/root/.triton/cache - /home/ai/club-3090/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/marlin.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py:ro - /home/ai/club-3090/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/MPLinearKernel.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/MPLinearKernel.py:ro environment: - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-} - CUDA_DEVICE_ORDER=PCI_BUS_ID - VLLM_WORKER_MULTIPROC_METHOD=spawn - NCCL_CUMEM_ENABLE=0 - NCCL_P2P_DISABLE=1 - VLLM_NO_USAGE_STATS=1 - VLLM_USE_FLASHINFER_SAMPLER=1 - OMP_NUM_THREADS=1 - PYTORCH_CUDA_ALLOC_CONF=${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True,max_split_size_mb:512} shm_size: "16gb" ipc: host deploy: resources: reservations: devices: - driver: nvidia device_ids: ["0", "2"] capabilities: [gpu] entrypoint: - /bin/bash - -c - | exec vllm serve ${VLLM_ENFORCE_EAGER:+--enforce-eager} "$@" - -- command: - --model - /root/.cache/huggingface/qwen3.6-27b-autoround-int4 - --served-model-name - qwen - --quantization - auto_round - --dtype - bfloat16 - --tensor-parallel-size - "2" - --disable-custom-all-reduce - --max-model-len - "${MAX_MODEL_LEN:-185000}" - --gpu-memory-utilization - "${GPU_MEMORY_UTILIZATION:-0.95}" - --max-num-seqs - "${MAX_NUM_SEQS:-2}" - --max-num-batched-tokens - "8192" - --language-model-only - --trust-remote-code - --reasoning-parser - qwen3 - --default-chat-template-kwargs - '{"enable_thinking": true}' - --enable-auto-tool-choice - --tool-call-parser - qwen3_coder - --enable-prefix-caching - --enable-chunked-prefill - --speculative-config - '{"method":"dflash","model":"/root/.cache/huggingface/qwen3.6-27b-dflash","num_speculative_tokens":5}' - --host - 0.0.0.0 - --port - "8000" 基于 [https://github.com/noonghunna/club-3090](https://github.com/noonghunna/club-3090) 有什么改进的建议吗?
查看原文

相似文章

Qwen3.6 27B 在 vLLM 中的表现比在 llama.cpp 中更差

Reddit r/LocalLLaMA

一名用户报告称,Qwen3.6-27B 模型在使用 llama.cpp 时比使用 vLLM 表现更好且更可靠,并指出尽管进行了大量配置,vLLM 仍出现工具调用错误和“被切除脑叶”的行为。

QWEN3.6 + ik_llama 快得离谱

Reddit r/LocalLLaMA

用户报告成功部署 Qwen 3.6 与 ik_llama 量化,在消费级硬件(16GB VRAM、32GB RAM)上实现 200k 上下文窗口下 50+ token/秒。