inference-benchmark

#inference-benchmark

[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup

Reddit r/LocalLLaMA ↗ · 2026-06-08

Benchmarks of DFlash speculative decoding combined with KV cache compression on RTX 5090 show up to 3.26x speedup on Qwen3.6-27B with minimal perplexity degradation, with q4_0/turbo4 providing the best balance.

0 favorites 0 likes

#inference-benchmark

Weird to get near linear scaling by adding another GPU?

Reddit r/LocalLLaMA ↗ · 2026-06-08

A user reports near-linear performance scaling when adding a second RTX 3090 for inference with a Qwen model, achieving roughly 1.8x decode TPS improvement without NVLink.

0 favorites 0 likes

#inference-benchmark

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Reddit r/LocalLLaMA ↗ · 2026-05-29

Benchmarks of Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B using vLLM and llama.cpp show up to 3.34x faster inference, with optimal speculative token counts varying by model and engine.

0 favorites 0 likes

#inference-benchmark

1000 tps generation on Qwen3.6 27B with V100s

Reddit r/LocalLLaMA ↗ · 2026-05-25

Achieved 1000 tokens per second generation on Qwen3.6 27B using V100 GPUs with 128 concurrent requests, and 80 t/s for single user.

0 favorites 0 likes

inference-benchmark

[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup

Weird to get near linear scaling by adding another GPU?

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

1000 tps generation on Qwen3.6 27B with V100s

Submit Feedback