Tag
Introduces MODE-RAG, a multi-agent system using Variational Free Energy and Monte Carlo Tree Search to dynamically gate interventions for mitigating hallucinations in Multimodal Retrieval-Augmented Generation systems, along with the ModeVent evaluation dataset.
This paper analyzes the thinking-answer inconsistency in multimodal reinforcement learning with verifiable rewards (RLVR) for large vision-language models and proposes CORA, a method that introduces a consistency reward model and hybrid reward advantage splitting to improve faithfulness and task performance.
VaaWIT is an end-to-end framework enhancing Large Vision-Language Models for multilingual Web image translation via dual-stream attention and visual-aware adapters, outperforming SOTA baselines.
VideoSeeker introduces a paradigm for instance-level video understanding that integrates agentic reasoning with visual prompts, achieving superior performance through automated data synthesis and reinforcement learning, outperforming GPT-4o and Gemini-2.5-Pro.
This paper investigates using large vision-language models for built environment reasoning tasks, such as design suggestions and risk identification, leveraging remote sensing imagery. It evaluates models like InternVL and Qwen, highlighting their potential for supporting smart city decision-making and quantitative reasoning.