Tag
Alibaba Qwen announces two major model releases: Qwen3-Omni, the first natively end-to-end omni-modal AI unifying text, image, audio and video, and Qwen3-Next-80B-A3B, an ultra-efficient MoE model with 3B activated parameters per token, achieving SOTA performance and 10x faster inference than Qwen3-32B.
Proposes attention-state memory, a training-free approach that stores precomputed attention states in lightweight memory to improve accuracy and reduce latency for long prefix inference, outperforming traditional methods on benchmarks.
RTPurbo leverages intrinsic sparsity in full-attention LLMs to achieve efficient long-context inference with minimal training overhead, enabling significant speedups while maintaining near-lossless accuracy.
BEAM introduces binary expert activation masking for dynamic routing in Mixture-of-Experts LLMs, achieving up to 85% FLOPs reduction with minimal performance loss and up to 2.5× faster decoding.
The paper proposes EGRSD and CL-EGRSD, on-policy self-distillation methods that weight token-level supervision by teacher entropy to improve reasoning accuracy-length tradeoff in LLMs, evaluated on Qwen3-4B and Qwen3-8B.
GridProbe is a training-free inference paradigm for Long-Video VLMs that adaptively selects relevant frames using posterior probing, achieving sub-quadratic attention costs with minimal accuracy loss.
NVIDIA releases Star Elastic, a novel AI architecture allowing a single checkpoint to function as 30B, 23B, and 12B models via zero-shot slicing. This approach enables dynamic budget control for reasoning tasks, significantly reducing latency and compute costs while maintaining accuracy.
This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.
DeepSeek releases V4, a MoE model with a 1M-token context window optimized for agentic tasks through hybrid attention and reduced KV cache requirements.
GlobalSplat introduces an efficient feed-forward framework for 3D Gaussian splatting that achieves compact and consistent scene reconstruction using global scene tokens, reducing computational overhead and inference time to under 78ms. The method uses a coarse-to-fine training approach to prevent representation bloat while maintaining competitive novel-view synthesis performance with significantly fewer Gaussians (16K) compared to dense baselines.
Switch-KD proposes a novel visual-switch knowledge distillation framework for efficiently compressing vision-language models by unifying multimodal knowledge transfer within a shared text-probability space. The method achieves 3.6-point average improvement across 10 multimodal benchmarks when distilling a 0.5B TinyLLaVA student from a 3B teacher model.
SANA-Video is a small diffusion model that efficiently generates high-resolution, long videos using linear attention and a constant-memory KV cache, achieving competitive performance at dramatically lower cost and faster speed compared to existing models.
AirLLM is an open-source library that enables running large language models (up to 405B) on a single 4GB GPU without quantization, distillation, or pruning, significantly lowering the hardware barrier for local LLM inference.