ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

Hugging Face Daily Papers Papers

Summary

Proposes ProMSA, a progressive multimodal search agent for knowledge-based visual question answering that adaptively selects search strategies and optimizes through sequence-level reinforcement learning, achieving consistent gains on E-VQA and InfoSeek.

Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-selected retriever and a static top-k setting, which is not adaptive during reasoning. We propose ProMSA, a progressive multimodal search agent for KB-VQA. Given an image-question pair, the agent iteratively chooses image search, text search, or stop, under explicit tool-call budgets and with deduplication to avoid redundant retrieval. For training, we first use rejection-sampling SFT to learn valid tool-use formats, then optimize the agent with TN-GSPO, a sequence-level RL objective that normalizes updates by both generation length and tool-interaction depth. Experiments on E-VQA and InfoSeek show consistent gains over strong RAG and agent baselines, and improved retrieval and end-to-end accuracy. The code is available at https://github.com/DingWu1021/Promsa.
Original Article
View Cached Full Text

Cached at: 06/29/26, 06:01 AM

Paper page - ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

Source: https://huggingface.co/papers/2606.27974 Authors:

,

,

,

,

,

,

,

,

,

Abstract

A progressive multimodal search agent for knowledge-based visual question answering that adaptively selects search strategies and optimizes through sequence-level reinforcement learning.

Knowledge-based Visual Question Answering(KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixedretrieve-then-generate pipelinewith a pre-selected retriever and a static top-k setting, which is not adaptive during reasoning. We propose ProMSA, a progressivemultimodal search agentfor KB-VQA. Given an image-question pair, the agent iteratively chooses image search, text search, or stop, under explicittool-call budgetsand withdeduplicationto avoid redundant retrieval. For training, we first userejection-sampling SFTto learn valid tool-use formats, then optimize the agent withTN-GSPO, asequence-level RL objectivethat normalizes updates by bothgeneration lengthandtool-interaction depth. Experiments on E-VQA and InfoSeek show consistent gains over strong RAG and agent baselines, and improved retrieval and end-to-end accuracy. The code is available at https://github.com/DingWu1021/Promsa.

View arXiv pageView PDFProject pageGitHub3Add to collection

Get this paper in your agent:

hf papers read 2606\.27974

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.27974 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.27974 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.27974 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MMSkills: Towards Multimodal Skills for General Visual Agents

arXiv cs.AI

This paper introduces MMSkills, a framework for representing, generating, and using multimodal procedural knowledge for visual agents, combining textual procedures with visual state cards and keyframes, and demonstrates improvements in GUI and game-based visual agent benchmarks.

Self-Evolving Visual Questioner

Hugging Face Daily Papers

This paper introduces a self-evolving framework for vision-language models to improve their question-generation capabilities without external supervision, enhancing both question quality and answerer performance.

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

Hugging Face Daily Papers

SuperMemory-VQA is a new egocentric VQA benchmark featuring 52.9 hours of AI-glasses footage and 4,853 QA pairs designed to evaluate AI assistants on long-horizon memory tasks spanning object recall, intent, timelines, and conversations. Benchmarking reveals existing agentic frameworks and LLMs remain far from reliable on these real-world memory challenges.