Tag
This paper systematically studies variants of QKV projection sharing in transformers, finding that sharing key and value projections (Q-K=V) achieves 50% KV cache reduction with only 3.1% perplexity degradation, and combining with GQA/MQA can reach up to 96.9% cache reduction—enabling practical on-device inference with minimal quality loss.
A developer successfully ran the 284B-parameter DeepSeek-V4-Flash model on a Raspberry Pi 5 at over 1 tok/s, using an untouched GGUF file from antirez after extensive experimentation.
A tweet demonstrates that Multi-Token Prediction (MTP) achieves significant speedups for Qwen models on dual RTX 5090 hardware, suggesting that local inference can now rival cloud-model performance.
A developer built a fully offline suitcase robot named Sparky using a Jetson Orin NX and Gemma 4 E4B model, achieving ~200ms cached TTFT and 14-15 tok/s with 30+ sensors feeding into the prompt as natural language, all without network connectivity.
A developer ran 10 concurrent agents of the 35B-parameter Qwen3.6 model on a single 74W GB10 GPU at 436 tok/s total using vLLM, demonstrating high-efficiency edge deployment.