UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Hugging Face Daily Papers Papers

Summary

UniDoc-RL presents a reinforcement learning framework for Large Vision-Language Models that optimizes retrieval, reranking, and visual reasoning through hierarchical decision-making and dense multi-reward supervision, achieving up to 17.7% improvements over prior RL-based methods on visual RAG tasks.

Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:28 AM

Paper page - UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Source: https://huggingface.co/papers/2604.14967

Abstract

UniDoc-RL introduces a reinforcement learning framework for LVLMs that jointly optimizes retrieval, reranking, visual perception, and reasoning through hierarchical decision-making and dense multi-reward supervision.

Retrieval-Augmented Generation (https://huggingface.co/papers?q=Retrieval-Augmented%20Generation)(RAG) extends Large Vision-Language Models (https://huggingface.co/papers?q=Large%20Vision-Language%20Models)(LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics (https://huggingface.co/papers?q=fine-grained%20visual%20semantics)essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning (https://huggingface.co/papers?q=reinforcement%20learning)framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception (https://huggingface.co/papers?q=active%20visual%20perception), and reasoning. UniDoc-RL formulates visual information acquisition (https://huggingface.co/papers?q=visual%20information%20acquisition)as a sequential decision-making (https://huggingface.co/papers?q=sequential%20decision-making)problem with a hierarchical action space (https://huggingface.co/papers?q=hierarchical%20action%20space). Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme (https://huggingface.co/papers?q=dense%20multi-reward%20scheme)that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (https://huggingface.co/papers?q=Group%20Relative%20Policy%20Optimization)(GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.

View arXiv page (https://arxiv.org/abs/2604.14967)View PDF (https://arxiv.org/pdf/2604.14967)GitHub8 (https://github.com/deepglint/UniDoc-RL)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2604.14967)

Get this paper in your agent:

hf papers read 2604.14967

Don’t have the latest CLI?curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.14967 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.14967 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.14967 in a Space README.md to link it from this page.

Collections including this paper2

Similar Articles

EasyVideoR1: Easier RL for Video Understanding

Hugging Face Daily Papers

EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.

Reinforcing Multimodal Reasoning Against Visual Degradation

Hugging Face Daily Papers

This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Hugging Face Daily Papers

OpenWebRL presents an open framework for training visual web agents using online multi-turn reinforcement learning on real websites, achieving state-of-the-art performance with minimal initial supervision. Their 4B-parameter model outperforms prior open agents and competes with proprietary systems like OpenAI CUA and Gemini CUA.

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Hugging Face Daily Papers

This paper proposes Hierarchical Advantage-Weighted Behavior Cloning (HABC) for fine-tuning Vision-Language-Action (VLA) policies using online reinforcement learning with sparse binary episode outcomes. HABC separates viability and efficiency objectives via adaptive critic heads and intervention-aware credit assignment, significantly improving success rates on contact-rich bimanual manipulation tasks.