Tag
A developer benchmarks Gemma 4 E4B using Google's LiteRT engine against a Q4 GGUF quant, finding ~2.4x speedup in text generation due to multi-token prediction (MTP), but only 1.1x in image captioning. The post provides a Python wrapper for an OpenAI-compatible endpoint, though with limitations like deterministic output and single-session engine.
Atomic Chat's MTP technique speeds up Qwen dense models by 2.5x and MoE models by 25% on 2x RTX 5090 with zero accuracy loss and ~1 GB extra VRAM, using speculative decoding to draft and verify multiple tokens in one pass.
A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.
NousResearch releases Lighthouse Attention, a selection-based hierarchical attention that achieves 1.4-1.7x wall-clock speedup at 98K context and ~17x faster forward/backward pass than standard attention at 512K context on a single B200, validated on 530M-parameter Llama-3 models across 50B tokens.