edge-inference

#edge-inference

Do transformers need three projections? Systematic study of QKV variants

Hacker News Top ↗ · yesterday Cached

This paper systematically studies variants of QKV projection sharing in transformers, finding that sharing key and value projections (Q-K=V) achieves 50% KV cache reduction with only 3.1% perplexity degradation, and combining with GQA/MQA can reach up to 96.9% cache reduction—enabling practical on-device inference with minimal quality loss.

0 favorites 0 likes

#edge-inference

@danveloper: I can't believe this works, but I got DeepSeek-V4-Flash (284B params) running on a Raspberry Pi 5 (8GB edition) at >1to…

X AI KOLs Timeline ↗ · 4d ago Cached

A developer successfully ran the 284B-parameter DeepSeek-V4-Flash model on a Raspberry Pi 5 at over 1 tok/s, using an untouched GGUF file from antirez after extensive experimentation.

0 favorites 0 likes

#edge-inference

@danyurkin: i don't think i need cloud models anymore

X AI KOLs Following ↗ · 2026-05-20 Cached

A tweet demonstrates that Multi-Token Prediction (MTP) achieves significant speedups for Qwen models on dual RTX 5090 hardware, suggesting that local inference can now rival cloud-model performance.

0 favorites 0 likes

#edge-inference

Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

Reddit r/LocalLLaMA ↗ · 2026-05-15

A developer built a fully offline suitcase robot named Sparky using a Jetson Orin NX and Gemma 4 E4B model, achieving ~200ms cached TTFT and 14-15 tok/s with 30+ sensors feeding into the prompt as natural language, all without network connectivity.

0 favorites 0 likes

#edge-inference

@iotcoi: Ran Google’s cookbook with 10 agents on my tiny GB10 GPU. 436 tok/s / 43.6 per agent Qwen3.6-35B + Dflash + DDTree on v…

X AI KOLs Timeline ↗ · 2026-04-22 Cached

A developer ran 10 concurrent agents of the 35B-parameter Qwen3.6 model on a single 74W GB10 GPU at 436 tok/s total using vLLM, demonstrating high-efficiency edge deployment.

0 favorites 0 likes

edge-inference

Do transformers need three projections? Systematic study of QKV variants

@danveloper: I can't believe this works, but I got DeepSeek-V4-Flash (284B params) running on a Raspberry Pi 5 (8GB edition) at >1to…

@danyurkin: i don't think i need cloud models anymore

Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.

@iotcoi: Ran Google’s cookbook with 10 agents on my tiny GB10 GPU. 436 tok/s / 43.6 per agent Qwen3.6-35B + Dflash + DDTree on v…

Submit Feedback