40+tok/s - optimized recipe for Qwen 3.5 122B Int4 on a single DGX Spark with vLLM

Reddit r/LocalLLaMA 05/20/26, 04:03 PM Tools

optimization inference speed qwen-3.5 vllm dgx-spark int4

Summary

User shares an optimized recipe for running Qwen 3.5 122B Int4 on a single DGX Spark with vLLM, achieving over 40 tokens per second. They invite others to try and further optimize it.

Hello guys, two days ago i ran the spark-arena for my Qwen 3.5 122B Recipe on a single DGX Spark and I got the highest score on speed for any context length and concurrency across all 3.5 122B Int4 Recipes. Just wanted to share if somebody wants to try, play around with it and optimize it further. [https://spark-arena.com/benchmark/sub1779146508448](https://spark-arena.com/benchmark/sub1779146508448) https://preview.redd.it/pz2dr3n4fb2h1.png?width=1099&format=png&auto=webp&s=40f078ae3df597545d08ed3df008f84873acca6a

Original Article

Similar Articles

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Reddit r/LocalLLaMA

A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.

40+tok/s - optimized recipe for Qwen 3.5 122B Int4 on a single DGX Spark with vLLM

Similar Articles

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Submit Feedback