efficient-inference

#efficient-inference

Qwen3.7 Preview lands on Arena (1 minute read)

TLDR AI ↗ · 2026-05-19 Cached

Alibaba Qwen announces two major model releases: Qwen3-Omni, the first natively end-to-end omni-modal AI unifying text, image, audio and video, and Qwen3-Next-80B-A3B, an ultra-efficient MoE model with 3B activated parameters per token, achieving SOTA performance and 10x faster inference than Qwen3-32B.

0 favorites 0 likes

#efficient-inference

Context Memorization for Efficient Long Context Generation

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

Proposes attention-state memory, a training-free approach that stores precomputed attention states in lightweight memory to improve accuracy and reduce latency for long prefix inference, outperforming traditional methods on benchmarks.

0 favorites 0 likes

#efficient-inference

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Hugging Face Daily Papers ↗ · 2026-05-16 Cached

RTPurbo leverages intrinsic sparsity in full-attention LLMs to achieve efficient long-context inference with minimal training overhead, enabling significant speedups while maintaining near-lossless accuracy.

0 favorites 0 likes

#efficient-inference

BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

arXiv cs.AI ↗ · 2026-05-15 Cached

BEAM introduces binary expert activation masking for dynamic routing in Mixture-of-Experts LLMs, achieving up to 85% FLOPs reduction with minimal performance loss and up to 2.5× faster decoding.

0 favorites 0 likes

#efficient-inference

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

arXiv cs.AI ↗ · 2026-05-14 Cached

The paper proposes EGRSD and CL-EGRSD, on-policy self-distillation methods that weight token-level supervision by teacher entropy to improve reasoning accuracy-length tradeoff in LLMs, evaluated on Qwen3-4B and Qwen3-8B.

0 favorites 0 likes

#efficient-inference

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

GridProbe is a training-free inference paradigm for Long-Video VLMs that adaptively selects relevant frames using posterior probing, achieving sub-quadratic attention costs with minimal accuracy loss.

0 favorites 0 likes

#efficient-inference

NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

Reddit r/LocalLLaMA ↗ · 2026-05-10

NVIDIA releases Star Elastic, a novel AI architecture allowing a single checkpoint to function as 30B, 23B, and 12B models via zero-shot slicing. This approach enables dynamic budget control for reasoning tasks, significantly reducing latency and compute costs while maintaining accuracy.

0 favorites 0 likes

#efficient-inference

Adaptive Computation Depth via Learned Token Routing in Transformers

arXiv cs.LG ↗ · 2026-05-08 Cached

This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.

0 favorites 0 likes

#efficient-inference

DeepSeek-V4: a million-token context that agents can actually use

Hugging Face Blog ↗ · 2026-04-24 Cached

DeepSeek releases V4, a MoE model with a 1M-token context window optimized for agentic tasks through hybrid attention and reduced KV cache requirements.

0 favorites 0 likes

#efficient-inference

GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

GlobalSplat introduces an efficient feed-forward framework for 3D Gaussian splatting that achieves compact and consistent scene reconstruction using global scene tokens, reducing computational overhead and inference time to under 78ms. The method uses a coarse-to-fine training approach to prevent representation bloat while maintaining competitive novel-view synthesis performance with significantly fewer Gaussians (16K) compared to dense baselines.

0 favorites 0 likes

#efficient-inference

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

Switch-KD proposes a novel visual-switch knowledge distillation framework for efficiently compressing vision-language models by unifying multimodal knowledge transfer within a shared text-probability space. The method achieves 3.6-point average improvement across 10 multimodal benchmarks when distilling a 0.5B TinyLLaVA student from a 3B teacher model.

0 favorites 0 likes

#efficient-inference

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Papers with Code Trending ↗ · 2025-09-29 Cached

SANA-Video is a small diffusion model that efficiently generates high-resolution, long videos using linear attention and a constant-memory KV cache, achieving competitive performance at dramatically lower cost and faster speed compared to existing models.

0 favorites 0 likes

#efficient-inference

lyogavin/airllm

GitHub Trending (daily) ↗ · 2026-06-03 Cached

AirLLM is an open-source library that enables running large language models (up to 405B) on a single 4GB GPU without quantization, distillation, or pruning, significantly lowering the hardware barrier for local LLM inference.

0 favorites 0 likes

efficient-inference

Submit Feedback