efficient-inference

#efficient-inference

NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

Reddit r/LocalLLaMA ↗ · 6h ago

NVIDIA releases Star Elastic, a novel AI architecture allowing a single checkpoint to function as 30B, 23B, and 12B models via zero-shot slicing. This approach enables dynamic budget control for reasoning tasks, significantly reducing latency and compute costs while maintaining accuracy.

0 favorites 0 likes

#efficient-inference

Adaptive Computation Depth via Learned Token Routing in Transformers

arXiv cs.LG ↗ · 2d ago Cached

This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.

0 favorites 0 likes

#efficient-inference

DeepSeek-V4: a million-token context that agents can actually use

Hugging Face Blog ↗ · 2026-04-24 Cached

DeepSeek releases V4, a MoE model with a 1M-token context window optimized for agentic tasks through hybrid attention and reduced KV cache requirements.

0 favorites 0 likes

#efficient-inference

GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

GlobalSplat introduces an efficient feed-forward framework for 3D Gaussian splatting that achieves compact and consistent scene reconstruction using global scene tokens, reducing computational overhead and inference time to under 78ms. The method uses a coarse-to-fine training approach to prevent representation bloat while maintaining competitive novel-view synthesis performance with significantly fewer Gaussians (16K) compared to dense baselines.

0 favorites 0 likes

#efficient-inference

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

Switch-KD proposes a novel visual-switch knowledge distillation framework for efficiently compressing vision-language models by unifying multimodal knowledge transfer within a shared text-probability space. The method achieves 3.6-point average improvement across 10 multimodal benchmarks when distilling a 0.5B TinyLLaVA student from a 3B teacher model.

0 favorites 0 likes

efficient-inference

NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

Adaptive Computation Depth via Learned Token Routing in Transformers

DeepSeek-V4: a million-token context that agents can actually use

GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Submit Feedback