large-vision-language-models

#large-vision-language-models

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

arXiv cs.CL ↗ · yesterday Cached

Introduces MODE-RAG, a multi-agent system using Variational Free Energy and Monte Carlo Tree Search to dynamically gate interventions for mitigating hallucinations in Multimodal Retrieval-Augmented Generation systems, along with the ModeVent evaluation dataset.

0 favorites 0 likes

#large-vision-language-models

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

arXiv cs.CL ↗ · 3d ago Cached

This paper analyzes the thinking-answer inconsistency in multimodal reinforcement learning with verifiable rewards (RLVR) for large vision-language models and proposes CORA, a method that introduces a consistency reward model and hybrid reward advantage splitting to improve faithfulness and task performance.

0 favorites 0 likes

#large-vision-language-models

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

Hugging Face Daily Papers ↗ · 2026-05-23 Cached

VaaWIT is an end-to-end framework enhancing Large Vision-Language Models for multilingual Web image translation via dual-stream attention and visual-aware adapters, outperforming SOTA baselines.

0 favorites 0 likes

#large-vision-language-models

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Hugging Face Daily Papers ↗ · 2026-05-15 Cached

VideoSeeker introduces a paradigm for instance-level video understanding that integrates agentic reasoning with visual prompts, achieving superior performance through automated data synthesis and reinforcement learning, outperforming GPT-4o and Gemini-2.5-Pro.

0 favorites 0 likes

#large-vision-language-models

Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models

arXiv cs.CL ↗ · 2026-05-12 Cached

This paper investigates using large vision-language models for built environment reasoning tasks, such as design suggestions and risk identification, leveraging remote sensing imagery. It evaluates models like InternVL and Qwen, highlighting their potential for supporting smart city decision-making and quantitative reasoning.

0 favorites 0 likes

large-vision-language-models

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models

Submit Feedback