Deepseek V4 flash 在 DGX Spark 上的性能

Reddit r/LocalLLaMA 新闻

摘要

一位 Reddit 用户分享了在双华硕 GX10 DGX Spark 配置上运行 DeepSeek V4 Flash 的经验,详细介绍了性能指标、配置和功耗,并提供了不同上下文长度下的吞吐量基准测试结果。

大家好,过去一周我一直在尝试让 Deepseek V4 在 DGX Spark 上运行。昨天,多亏了 [local-inference-lab](https://github.com/local-inference-lab) 团队的辛勤工作,我终于成功了。我手头的是 ASUS GX10 版本。两台 GX10 通过它们的 connect X-7 端口连接,在 Docker 里用一套很粗糙的配置跑了起来。KV 缓存最大能安全容纳约 1M token 的上下文。为了并发稳定,我一般设成 256k。现在跑的是 Deepseek v4 flash 的原始 MXFP8 x MXFP4 模型。市面上也有 NVFP4 的变体,但我还没测试过。等软件支持更成熟后,我猜 NVFP4 变体在 Spark 的高并发场景下会带来更好的性能。吞吐量还不错。Spark 的并发只有我一个人在用,所以并发对我来说不算关键。最多一次批处理跑 3–4 个请求,但通常就 1–2 个。Spark 处理起来完全没问题,只是 TTFT 自然会稍长一些。我没有用 llama.cpp 或 LM Studio 这类变体,因为对我来说性能比兼容性更重要。所以目前用的是 vLLM。以下是在不同上下文窗口下,并发数 = 1 时的性能表现: |上下文|预填充 T/S|解码 T/S (MTP =2)| |:-|:-|:-| |4K|2050|49.4| |16k|2150|43.0| |32k|2130|37.9| |128k|1920|42.5| |256K|1680|39.8| 可以看到性能非常稳定,衰减也很小。32k 处的异常可能是由于该尺寸下的冷内核导致的。在 C=4、128k 时,我通常看到聚合约 40–42 个 token,所以每个请求大约 10 t/s。不算很快,但对我来说也不常见。这个模型也非常聪明,在我私有的高上下文检索和推理基准测试中,V4 flash 轻松击败了 M2.7 和 Stepfun 3.7(高推理模式,非最大)。它在世界知识方面不如 V4 Pro 或 Kimi K2.6 这样更密集的模型,但在原始智能方面非常好。它可能是我用过的最好的模型。我对我的 DGX Spark 非常满意。Deepseek 显然做得非常出色,Deepseek 在 V4 中使用的很多新技术也会被其他地方采用。总的来说,Spark 让我印象深刻。它运行密集模型(如 Gemma4 27B 或 Qwen3.6 26B)的表现不太好,但在 MOE 模型上的性能非常出色。尤其是当激活权重低于 15B 时。功耗非常低。满载总功耗约 280 瓦,而且可以在高负载下长时间非常稳定地运行。 如果你想在 DGX Spark 上跑 DSV4-flash,这里有一个 docker compose 配置。它基于:[local-inference-lab/vllm at dev/unholy-fusion](https://github.com/local-inference-lab/vllm/tree/dev/unholy-fusion) 构建,主要是为了修复我遇到的一些前缀缓存和崩溃问题。这些修复会被推送到我自己的 fork:[aidendle94 (Aiden Le)](https://github.com/aidendle94)。 ``` # DeepSeek-V4-Flash on DGX Spark GB10 (arm64 / sm_121a) — TP=2 over RoCE. # # IMPORTANT NOTES BEFORE YOU RUN: # 1. arm64 ONLY (GB10/Spark). Will NOT run on x86. # 2. The MODEL WEIGHTS are NOT in the image. They live in the mounted HF cache # (~148 GB). Download deepseek-ai/DeepSeek-V4-Flash into ${HF_CACHE} first. # 3. This is a 2-NODE setup. docker compose is single-host, so you run this SAME # file on EACH node with different env (NODE_RANK / HEADLESS). Start the WORKER # (rank 1) first, then the HEAD (rank 0). # 4. The NCCL_* values (NCCL_IB_HCA, NCCL_SOCKET_IFNAME) and MASTER_ADDR are # SITE-SPECIFIC — edit them to match YOUR NICs and head-node IP. # 5. For a SINGLE GPU / single node: set TP=1 and delete the --nnodes/--node-rank/ # --master-addr lines + the multi-node env, and drop /dev/infiniband + NCCL_IB_*. # # Per-node launch (set via a .env file or inline): # HEAD (node 0): NODE_RANK=0 HEADLESS= MASTER_ADDR=<head-ip> docker compose up # WORKER(node 1): NODE_RANK=1 HEADLESS=1 MASTER_ADDR=<head-ip> docker compose up # start this FIRST services: vllm: image: aidendle94/sparkrun-vllm-ds4-gb10:production-ready network_mode: host # NCCL bootstrap + RoCE need the host network ipc: host shm_size: "10gb" gpus: all # all local GPUs (1 per GB10 node). If your compose # is older and rejects this, use the deploy: block below instead. devices: - /dev/infiniband:/dev/infiniband # RoCE / IB verbs (omit on single-node) volumes: - ${HF_CACHE:-${HOME}/.cache/huggingface}:/cache/huggingface # model + JIT caches - /etc/passwd:/etc/passwd:ro - /etc/group:/etc/group:ro environment: # --- model / cache / vLLM --- HF_HOME: /cache/huggingface HF_HUB_OFFLINE: "1" VLLM_CACHE_ROOT: /cache/huggingface/vllm-cache VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1" VLLM_USE_B12X_MOE: "1" VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "256" VLLM_NCCL_SO_PATH: /opt/env/lib/python3.12/site-packages/nvidia/nccl/lib/libnccl.so.2 # --- GB10 arch --- TORCH_CUDA_ARCH_LIST: "12.1a" FLASHINFER_CUDA_ARCH_LIST: "12.1a" # --- NCCL / RoCE (SITE-SPECIFIC: edit for your NICs) --- NCCL_NET: IB NCCL_IB_DISABLE: "0" NCCL_IB_HCA: "rocep1s0f0,roceP2p1s0f0" NCCL_SOCKET_IFNAME: "enP7s7,enp1s0f0np0,enP2p1s0f0np0" NCCL_IB_GID_INDEX: "3" NCCL_CROSS_NIC: "1" NCCL_CUMEM_ENABLE: "0" NCCL_IGNORE_CPU_AFFINITY: "1" NCCL_DEBUG: WARN # --- per-node (CHANGE PER HOST) --- NODE_RANK: "${NODE_RANK:?set 0 on head, 1 on worker}" HEADLESS: "${HEADLESS:-}" # empty on head, "1" on worker MASTER_ADDR: "${MASTER_ADDR:?head-node IP}" # The image bakes /usr/local/bin/dsv4-vllm-entrypoint. We wrap in bash so the # ${HEADLESS:+--headless} flag is only added on the worker. command: - bash - -lc - > exec /usr/local/bin/dsv4-vllm-entrypoint serve deepseek-ai/DeepSeek-V4-Flash --served-model-name ChatGPTN --host 0.0.0.0 --port 8000 --trust-remote-code --tensor-parallel-size 2 --pipeline-parallel-size 1 --kv-cache-dtype fp8 --block-size 256 --max-model-len 262144 --max-num-seqs 4 --max-num-batched-tokens 8192 --gpu-memory-utilization 0.8 --enable-prefix-caching --speculative-config '{"method":"mtp","num_speculative_tokens":2}' --tokenizer-mode deepseek_v4 --distributed-executor-backend mp --tool-call-parser deepseek_v4 --enable-auto-tool-choice --reasoning-parser deepseek_v4 --default-chat-template-kwargs.thinking=true --default-chat-template-kwargs.reasoning_effort=high --enable-flashinfer-autotune --nnodes 2 --node-rank ${NODE_RANK} --master-addr ${MASTER_ADDR} --master-port 25000 ${HEADLESS:+--headless} # If `gpus: all` isn't supported by your compose version, remove it and use: # deploy: # resources: # reservations: # devices: # - driver: nvidia # count: all # capabilities: [gpu] ```
查看原文

相似文章