inference-speed

#inference-speed

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

Reddit r/LocalLLaMA ↗ · 2d ago

A new implementation of Multi-Token Prediction (MTP) in llama.cpp achieves a 40% speedup for Gemma 4 models, tested on a MacBook Pro M5Max. The post provides links to quantized GGUF models and the patched source code.

0 favorites 0 likes

#inference-speed

@seclink: Just hit 134 tok/s with Qwen 3.5-27B Dense and 73 tok/s with the new Qwen 3.6-27B on a single RTX 3090. The 2026 open-source scene is moving at lightspeed…

X AI KOLs Following ↗ · 2026-04-23 Cached

A single RTX 3090 pushes 134 tok/s on the fresh 27B Qwen 3.5 Dense and 73 tok/s on Qwen 3.6-27B via fused kernels plus speculative decoding, with GGUF drops the same evening.

1 favorites 1 likes

#inference-speed

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

Reddit r/MachineLearning ↗ · 2026-04-23

Author shares experience hitting diminishing returns with FP16 + ONNX + pruning on 162 MB transformer, seeks advice on next best steps among quantization, distillation, low-rank factorization, or hardware-specific tricks.

0 favorites 0 likes

#inference-speed

What speed is everyone getting on Qwen3.6 27b?

Reddit r/LocalLLaMA ↗ · 2026-04-22

User benchmarks Qwen3.6-27B-Q8_0 at ~13 tokens/sec on 3 mixed GPUs with 128k context via llama.cpp, asking if performance is typical.

0 favorites 0 likes

inference-speed

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

@seclink: Just hit 134 tok/s with Qwen 3.5-27B Dense and 73 tok/s with the new Qwen 3.6-27B on a single RTX 3090. The 2026 open-source scene is moving at lightspeed…

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

What speed is everyone getting on Qwen3.6 27b?

Submit Feedback