EasyVideoR1: Easier RL for Video Understanding
Summary
EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.
View Cached Full Text
Cached at: 04/21/26, 07:20 AM
Paper page - EasyVideoR1: Easier RL for Video Understanding
Source: https://huggingface.co/papers/2604.16893
Abstract
EasyVideoR1 presents an efficient reinforcement learning framework for video understanding that improves training throughput, supports diverse video tasks, and enables joint image-video training with comprehensive evaluation across multiple benchmarks.
Reinforcement learning from verifiable rewards(RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR tovideo understandingbecomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present EasyVideoR1, a complete and efficient reinforcement learning framework specifically designed for traininglarge vision-language modelsonvideo understandingtasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline withoffline preprocessingandtensor cachingthat eliminates redundant video decoding and yields a 1.47 times throughput improvement; (2) a comprehensive, task-awarereward systemcovering 11 distinct video and image problem types with unified routing and modular extension; (3) amixed offline-online data trainingparadigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4)joint image-video trainingwith independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) anasynchronous multi-benchmark evaluationframework covering 22 mainstreamvideo understandingbenchmarks, with reproduced accuracy closely aligned with officially reported scores.
View arXiv pageView PDFGitHub9Add to collection
Get this paper in your agent:
hf papers read 2604\.16893
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.16893 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.16893 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.16893 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards
UniDoc-RL presents a reinforcement learning framework for Large Vision-Language Models that optimizes retrieval, reranking, and visual reasoning through hierarchical decision-making and dense multi-reward supervision, achieving up to 17.7% improvements over prior RL-based methods on visual RAG tasks.
Visual Reasoning through Tool-supervised Reinforcement Learning
Introduces ToolsRL, a two-stage reinforcement learning framework that teaches multimodal LLMs to use simple visual tools for complex visual reasoning tasks.
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 introduces a reliability-perplexity aware reward distillation framework for streaming video generation that adaptively weights supervision to improve visual and motion quality without additional computational overhead.
RL²: Fast reinforcement learning via slow reinforcement learning
RL² proposes encoding a fast reinforcement learning algorithm as the weights of a recurrent neural network, learned through slow general-purpose RL, enabling agents to adapt to new tasks with few trials similar to biological learning. The method demonstrates strong performance on both small-scale bandit problems and large-scale vision-based navigation tasks.