INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference

arXiv cs.LG Papers

Summary

INAR-VL proposes a lightweight routing system for edge-cloud vision-language inference that dynamically selects between edge and cloud models based on query complexity, achieving significant latency and energy reductions while preserving near-cloud accuracy.

arXiv:2605.18853v1 Announce Type: new Abstract: Edge deployment of Vision-Language Models (VLMs) faces a tradeoff between latency and accuracy: cloud execution provides high-quality predictions but incurs communication delay and energy cost, while edge-only execution is faster but less accurate due to limited model capacity. This trade-off is further complicated by heterogeneity in image quality and reasoning complexity, making static placement suboptimal. We present INAR-VL, a lightweight edge-cloud routing system for multimodal inference in a two-tier deployment. INAR-VL maintains complementary VLMs across edge and cloud and uses lightweight image and text complexity signals to guide routing and model selection, executing simple queries locally while offloading complex ones when beneficial. Evaluation on visual question answering shows that INAR-VL executes 36% of requests on the edge, reduces latency by 24%, lowers energy by 26%, and preserves 97% of cloud-level accuracy.
Original Article

Similar Articles

Learning Agent Routing From Early Experience

arXiv cs.CL

This paper introduces BoundaryRouter, a training-free framework that optimizes LLM agent usage by routing queries to either lightweight inference or full agent execution based on early experience. It also presents RouteBench, a benchmark for evaluating routing performance, showing significant improvements in speed and accuracy.

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Hugging Face Daily Papers

OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.