UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Hugging Face Daily Papers 04/16/26, 12:00 AM Papers

visual-rag reinforcement-learning lvlm hierarchical-actions multi-reward document-retrieval

Summary

UniDoc-RL presents a reinforcement learning framework for Large Vision-Language Models that optimizes retrieval, reranking, and visual reasoning through hierarchical decision-making and dense multi-reward supervision, achieving up to 17.7% improvements over prior RL-based methods on visual RAG tasks.

Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:28 AM

Paper page - UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Source: https://huggingface.co/papers/2604.14967

Abstract

UniDoc-RL introduces a reinforcement learning framework for LVLMs that jointly optimizes retrieval, reranking, visual perception, and reasoning through hierarchical decision-making and dense multi-reward supervision.

Retrieval-Augmented Generation (https://huggingface.co/papers?q=Retrieval-Augmented%20Generation)(RAG) extends Large Vision-Language Models (https://huggingface.co/papers?q=Large%20Vision-Language%20Models)(LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics (https://huggingface.co/papers?q=fine-grained%20visual%20semantics)essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning (https://huggingface.co/papers?q=reinforcement%20learning)framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception (https://huggingface.co/papers?q=active%20visual%20perception), and reasoning. UniDoc-RL formulates visual information acquisition (https://huggingface.co/papers?q=visual%20information%20acquisition)as a sequential decision-making (https://huggingface.co/papers?q=sequential%20decision-making)problem with a hierarchical action space (https://huggingface.co/papers?q=hierarchical%20action%20space). Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme (https://huggingface.co/papers?q=dense%20multi-reward%20scheme)that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (https://huggingface.co/papers?q=Group%20Relative%20Policy%20Optimization)(GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.

View arXiv page (https://arxiv.org/abs/2604.14967)View PDF (https://arxiv.org/pdf/2604.14967)GitHub8 (https://github.com/deepglint/UniDoc-RL)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.14967)

Get this paper in your agent:

hf papers read 2604.14967

Don’t have the latest CLI?curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.14967 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.14967 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.14967 in a Space README.md to link it from this page.

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Paper page - UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper2

Similar Articles

EasyVideoR1: Easier RL for Video Understanding

Reinforcing Multimodal Reasoning Against Visual Degradation

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Visual Reasoning through Tool-supervised Reinforcement Learning

Submit Feedback

Similar Articles

EasyVideoR1: Easier RL for Video Understanding

Reinforcing Multimodal Reasoning Against Visual Degradation

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Visual Reasoning through Tool-supervised Reinforcement Learning