ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering
Summary
Proposes ProMSA, a progressive multimodal search agent for knowledge-based visual question answering that adaptively selects search strategies and optimizes through sequence-level reinforcement learning, achieving consistent gains on E-VQA and InfoSeek.
View Cached Full Text
Cached at: 06/29/26, 06:01 AM
Paper page - ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering
Source: https://huggingface.co/papers/2606.27974 Authors:
,
,
,
,
,
,
,
,
,
Abstract
A progressive multimodal search agent for knowledge-based visual question answering that adaptively selects search strategies and optimizes through sequence-level reinforcement learning.
Knowledge-based Visual Question Answering(KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixedretrieve-then-generate pipelinewith a pre-selected retriever and a static top-k setting, which is not adaptive during reasoning. We propose ProMSA, a progressivemultimodal search agentfor KB-VQA. Given an image-question pair, the agent iteratively chooses image search, text search, or stop, under explicittool-call budgetsand withdeduplicationto avoid redundant retrieval. For training, we first userejection-sampling SFTto learn valid tool-use formats, then optimize the agent withTN-GSPO, asequence-level RL objectivethat normalizes updates by bothgeneration lengthandtool-interaction depth. Experiments on E-VQA and InfoSeek show consistent gains over strong RAG and agent baselines, and improved retrieval and end-to-end accuracy. The code is available at https://github.com/DingWu1021/Promsa.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2606\.27974
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.27974 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.27974 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.27974 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning
Visual-Seeker proposes a visual-native multimodal deep search agent that actively reasons over fine-grained visual details and synthesizes multimodal evidence, achieving state-of-the-art performance on five challenging multimodal search benchmarks.
MMSkills: Towards Multimodal Skills for General Visual Agents
This paper introduces MMSkills, a framework for representing, generating, and using multimodal procedural knowledge for visual agents, combining textual procedures with visual state cards and keyframes, and demonstrates improvements in GUI and game-based visual agent benchmarks.
Self-Evolving Visual Questioner
This paper introduces a self-evolving framework for vision-language models to improve their question-generation capabilities without external supervision, enhancing both question quality and answerer performance.
SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory
SuperMemory-VQA is a new egocentric VQA benchmark featuring 52.9 hours of AI-glasses footage and 4,853 QA pairs designed to evaluate AI assistants on long-horizon memory tasks spanning object recall, intent, timelines, and conversations. Benchmarking reveals existing agentic frameworks and LLMs remain far from reliable on these real-world memory challenges.
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
Introduces MultiSearch, an RL-based framework that generates multiple queries at each reasoning step and explicitly merges retrieved information to improve signal-to-noise ratio and reasoning accuracy in question-answering tasks.