speedup

Tag

Cards List
#speedup

Using Gemma 4 E4B with the LiteRT engine - ~2.4x speedup over Q4 GGUF in text generation, image processing roughly the same

Reddit r/LocalLLaMA · 6d ago

A developer benchmarks Gemma 4 E4B using Google's LiteRT engine against a Q4 GGUF quant, finding ~2.4x speedup in text generation due to multi-token prediction (MTP), but only 1.1x in image captioning. The post provides a Python wrapper for an OpenAI-compatible endpoint, though with limitations like deterministic output and single-session engine.

0 favorites 0 likes
#speedup

@atomic_chat_hq: MTP speedup Qwen by 2.5x in Atomic Chat Dense vs MoE models on 2x RTX 5090 Qwen3.6 27B: 51 → 117 tps +137% Qwen3.6 35B-…

X AI KOLs Timeline · 2026-05-20 Cached

Atomic Chat's MTP technique speeds up Qwen dense models by 2.5x and MoE models by 25% on 2x RTX 5090 with zero accuracy loss and ~1 GB extra VRAM, using speculative decoding to draft and verify multiple tokens in one pass.

0 favorites 0 likes
#speedup

Dual GPU llama.cpp speedup

Reddit r/LocalLLaMA · 2026-05-17

A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.

0 favorites 0 likes
#speedup

@NousResearch: Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that deli…

X AI KOLs Following · 2026-05-15

NousResearch releases Lighthouse Attention, a selection-based hierarchical attention that achieves 1.4-1.7x wall-clock speedup at 98K context and ~17x faster forward/backward pass than standard attention at 512K context on a single B200, validated on 530M-parameter Llama-3 models across 50B tokens.

0 favorites 0 likes
← Back to home

Submit Feedback