INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference

arXiv cs.LG 05/20/26, 04:00 AM Papers

edge-computing vision-language-models inference-routing cloud-edge multimodal latency-energy

Summary

INAR-VL proposes a lightweight routing system for edge-cloud vision-language inference that dynamically selects between edge and cloud models based on query complexity, achieving significant latency and energy reductions while preserving near-cloud accuracy.

arXiv:2605.18853v1 Announce Type: new Abstract: Edge deployment of Vision-Language Models (VLMs) faces a tradeoff between latency and accuracy: cloud execution provides high-quality predictions but incurs communication delay and energy cost, while edge-only execution is faster but less accurate due to limited model capacity. This trade-off is further complicated by heterogeneity in image quality and reasoning complexity, making static placement suboptimal. We present INAR-VL, a lightweight edge-cloud routing system for multimodal inference in a two-tier deployment. INAR-VL maintains complementary VLMs across edge and cloud and uses lightweight image and text complexity signals to guide routing and model selection, executing simple queries locally while offloading complex ones when beneficial. Evaluation on visual question answering shows that INAR-VL executes 36% of requests on the edge, reduces latency by 24%, lowers energy by 26%, and preserves 97% of cloud-level accuracy.

Original Article

Similar Articles

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Hugging Face Daily Papers

VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Hugging Face Daily Papers

Proposes Reroute, a training-free plug-in for vision-language models that replaces irreversible visual-token pruning with recoverable routing, allowing tokens to re-enter the pipeline later to improve grounding under aggressive token reduction while maintaining VQA performance.

Learning Agent Routing From Early Experience

arXiv cs.CL

This paper introduces BoundaryRouter, a training-free framework that optimizes LLM agent usage by routing queries to either lightweight inference or full agent execution based on early experience. It also presents RouteBench, a benchmark for evaluating routing performance, showing significant improvements in speed and accuracy.

StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

Hugging Face Daily Papers

This paper introduces an Information Bottleneck Adapter (IB-Adapter) for Vision-Language-Action (VLA) models to improve robustness against unseen visual disturbances without requiring extra data, achieving up to 30% improvement with minimal parameter overhead.

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Hugging Face Daily Papers

OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.

Similar Articles

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Learning Agent Routing From Early Experience

StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Submit Feedback