model-efficiency

Tag

Cards List
#model-efficiency

Do transformers need three projections? Systematic study of QKV variants

Hacker News Top · 18h ago Cached

This paper systematically studies variants of QKV projection sharing in transformers, finding that sharing key and value projections (Q-K=V) achieves 50% KV cache reduction with only 3.1% perplexity degradation, and combining with GQA/MQA can reach up to 96.9% cache reduction—enabling practical on-device inference with minimal quality loss.

0 favorites 0 likes
#model-efficiency

Complexity-Balanced Diffusion Splitting

Hugging Face Daily Papers · yesterday Cached

Complexity-Balanced Splitting (CBS) partitions the diffusion timeline into segments of equal approximation burden using local complexity measures, improving synthesis quality by ~35% in FID without increasing inference cost.

0 favorites 0 likes
#model-efficiency

@rohanpaul_ai: A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of e…

X AI KOLs Timeline · 2026-05-24 Cached

A new method called Zero-Expert Self-Distillation Adaptation (ZEDA) allows MoE models like Qwen3 and GLM to skip half their expert computations on easy tokens with minimal accuracy loss, achieving ~20% inference speedup by adding dummy experts that output nothing.

0 favorites 0 likes
#model-efficiency

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

Hugging Face Daily Papers · 2026-05-20 Cached

Q-ARVD is a novel quantization framework to reduce inference costs of autoregressive video diffusion models by addressing frame-wise sensitivity imbalance and weight outlier patterns.

0 favorites 0 likes
#model-efficiency

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Hugging Face Daily Papers · 2026-05-18 Cached

ZEDA is a low-cost framework that converts post-trained static MoE models into dynamic ones by injecting zero-output experts and using self-distillation, achieving over 50% expert FLOP reduction with marginal accuracy loss on benchmarks.

0 favorites 0 likes
#model-efficiency

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

arXiv cs.CL · 2026-05-13 Cached

This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.

0 favorites 0 likes
#model-efficiency

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

Hugging Face Daily Papers · 2026-05-11 Cached

SlimSpec introduces a low-rank parameterization for drafter LM-heads to accelerate speculative decoding in LLMs, achieving 4-5x speedup while maintaining full vocabulary support.

0 favorites 0 likes
#model-efficiency

Large Vision-Language Models Get Lost in Attention

arXiv cs.AI · 2026-05-08 Cached

This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.

0 favorites 0 likes
← Back to home

Submit Feedback