Tag
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
This paper introduces Louver, a novel index structure for KV cache retrieval that reformulates sparse attention as a range searching problem, guaranteeing zero false negatives and improving efficiency over existing methods.
NVIDIA utilized late interaction, a form of sparse attention, for an attention-based encoder-decoder to retrieve directly from internal representations.
The paper introduces MISA, a method that applies a mixture-of-experts approach to the indexer heads in sparse attention mechanisms, significantly reducing computational costs for long-context LLM inference while maintaining performance.
This paper introduces In-context Sparse Attention (ISA), a framework that significantly reduces computational costs in video editing by pruning redundant context and using dynamic query grouping. The authors demonstrate the method's effectiveness with LIVEditor, achieving near-lossless acceleration and state-of-the-art results on multiple video editing benchmarks.