Tag
Proposes Reroute, a training-free plug-in for vision-language models that replaces irreversible visual-token pruning with recoverable routing, allowing tokens to re-enter the pipeline later to improve grounding under aggressive token reduction while maintaining VQA performance.
Introduces Efficient Operator Search (EOS), a unified differentiable framework that generalizes token reduction methods (pruning, merging, pooling, adaptive reweighting) into a shared operator space, automatically searching for optimal operator compositions under budget constraints. The method achieves competitive results across benchmarks and reveals consistent operator patterns.
Vulpine is a compiler that transforms human-readable Python code into a compressed macro representation optimized for LLMs, reducing token count by 13.8% on average while enabling exact structural reconstruction.
AQuaUI is a training-free inference-time token reduction method for GUI agent models that uses adaptive quadtrees to reduce spatial redundancy in screenshots, achieving up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of performance.
This paper introduces PUMA, a plug-and-play framework that detects semantic redundancy in chain-of-thought reasoning to enable early exit, achieving 26.2% average token reduction across multiple models and benchmarks while preserving accuracy and reasoning quality.
The LOOP Skill Engine achieves 99% success and 99% token reduction for periodic AI agent tasks by recording a single LLM-driven execution and replaying it deterministically via a parameterized, branch-free skill, eliminating stochastic failures and high costs.
Tencent AI has open-sourced an Agent memory system that significantly improves token efficiency and agent consistency in long dialogues through three methods: real-time context compression, Mermaid task maps, and Persona memory. Token consumption is reduced by 61%, and persona consistency jumps from 48% to 76%.
This paper introduces 'Hint Tuning,' a data-efficient method that reduces token usage in reasoning models by calibrating reasoning depth based on problem difficulty. It achieves significant token reduction (24–66%) on models like Qwen3-Thinking and DeepSeek-R1-Distill using only 1K self-annotated samples.
AVR is an adaptive visual reasoning framework that dynamically selects optimal reasoning formats to reduce token usage by 50-90% while maintaining accuracy in visual reasoning tasks. The method addresses reasoning path redundancy by decomposing visual reasoning into three cognitive functions and using FS-GRPO training to encourage efficient format selection.
RTK is a high-performance CLI proxy that filters and compresses command outputs before they reach LLM context, reducing token consumption by 60-90% with minimal overhead.