INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference
Summary
INAR-VL proposes a lightweight routing system for edge-cloud vision-language inference that dynamically selects between edge and cloud models based on query complexity, achieving significant latency and energy reductions while preserving near-cloud accuracy.
Similar Articles
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
Proposes Reroute, a training-free plug-in for vision-language models that replaces irreversible visual-token pruning with recoverable routing, allowing tokens to re-enter the pipeline later to improve grounding under aggressive token reduction while maintaining VQA performance.
Learning Agent Routing From Early Experience
This paper introduces BoundaryRouter, a training-free framework that optimizes LLM agent usage by routing queries to either lightweight inference or full agent execution based on early experience. It also presents RouteBench, a benchmark for evaluating routing performance, showing significant improvements in speed and accuracy.
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
This paper introduces an Information Bottleneck Adapter (IB-Adapter) for Vision-Language-Action (VLA) models to improve robustness against unseen visual disturbances without requiring extra data, achieving up to 30% improvement with minimal parameter overhead.
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.