token-reduction

#token-reduction

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Hugging Face Daily Papers ↗ · 5d ago Cached

Proposes Reroute, a training-free plug-in for vision-language models that replaces irreversible visual-token pruning with recoverable routing, allowing tokens to re-enter the pipeline later to improve grounding under aggressive token reduction while maintaining VQA performance.

0 favorites 0 likes

#token-reduction

Differentiable Efficient Operator Search

arXiv cs.LG ↗ · 2026-06-05 Cached

Introduces Efficient Operator Search (EOS), a unified differentiable framework that generalizes token reduction methods (pruning, merging, pooling, adaptive reweighting) into a shared operator space, automatically searching for optimal operator compositions under budget constraints. The method achieves competitive results across benchmarks and reveals consistent operator patterns.

0 favorites 0 likes

#token-reduction

I built a compiler that rewrites Python into a model-facing representation

Reddit r/LocalLLaMA ↗ · 2026-06-03 Cached

Vulpine is a compiler that transforms human-readable Python code into a compressed macro representation optimized for LLMs, reducing token count by 13.8% on average while enabling exact structural reconstruction.

0 favorites 0 likes

#token-reduction

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

arXiv cs.AI ↗ · 2026-05-20 Cached

AQuaUI is a training-free inference-time token reduction method for GUI agent models that uses adaptive quadtrees to reduce spatial redundancy in screenshots, achieving up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of performance.

0 favorites 0 likes

#token-reduction

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

Hugging Face Daily Papers ↗ · 2026-05-17 Cached

This paper introduces PUMA, a plug-and-play framework that detects semantic redundancy in chain-of-thought reasoning to enable early exit, achieving 26.2% average token reduction across multiple models and benchmarks while preserving accuracy and reasoning quality.

0 favorites 0 likes

#token-reduction

Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay

arXiv cs.AI ↗ · 2026-05-15 Cached

The LOOP Skill Engine achieves 99% success and 99% token reduction for periodic AI agent tasks by recording a single LLM-driven execution and replaying it deterministically via a parameterized, branch-free skill, eliminating stochastic failures and high costs.

0 favorites 0 likes

#token-reduction

@berryxia: Agent memory is incredibly competitive! I have to say, the more people join this track, the better it gets! The Tencent AI team spent a full 6 months tackling just one problem: AI agents frequently dropping context in long conversations. They ended up building a complete memory system and open-sourced it directly. After reading their sharing, my biggest takeaway is...

X AI KOLs Timeline ↗ · 2026-05-14 Cached

Tencent AI has open-sourced an Agent memory system that significantly improves token efficiency and agent consistency in long dialogues through three methods: real-time context compression, Mermaid task maps, and Persona memory. Token consumption is reduced by 61%, and persona consistency jumps from 48% to 76%.

0 favorites 0 likes

#token-reduction

Hint Tuning: Less Data Makes Better Reasoners

arXiv cs.CL ↗ · 2026-05-12 Cached

This paper introduces 'Hint Tuning,' a data-efficient method that reduces token usage in reasoning models by calibrating reasoning depth based on problem difficulty. It achieves significant token reduction (24–66%) on models like Qwen3-Thinking and DeepSeek-R1-Distill using only 1K self-annotated samples.

0 favorites 0 likes

#token-reduction

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

AVR is an adaptive visual reasoning framework that dynamically selects optimal reasoning formats to reduce token usage by 50-90% while maintaining accuracy in visual reasoning tasks. The method addresses reasoning path redundancy by decomposing visual reasoning into three cognitive functions and using FS-GRPO training to encourage efficient format selection.

0 favorites 0 likes

#token-reduction

rtk-ai/rtk

GitHub Trending (daily) ↗ · 2026-05-19 Cached

RTK is a high-performance CLI proxy that filters and compresses command outputs before they reach LLM context, reducing token consumption by 60-90% with minimal overhead.

0 favorites 0 likes

token-reduction

Submit Feedback