Tag
This paper investigates seemingly contradictory findings on whether large vision-language models (LVLMs) can coordinate efficient referring expressions. The authors show that models can achieve efficiency when explicitly prompted, but fail to infer the need for efficiency from implicit prompts, revealing key differences between human and AI communication.
UniDoc-RL presents a reinforcement learning framework for Large Vision-Language Models that optimizes retrieval, reranking, and visual reasoning through hierarchical decision-making and dense multi-reward supervision, achieving up to 17.7% improvements over prior RL-based methods on visual RAG tasks.