EasyVideoR1: Easier RL for Video Understanding

Hugging Face Daily Papers Papers

Summary

EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.

Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present EasyVideoR1, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 times throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - EasyVideoR1: Easier RL for Video Understanding

Source: https://huggingface.co/papers/2604.16893

Abstract

EasyVideoR1 presents an efficient reinforcement learning framework for video understanding that improves training throughput, supports diverse video tasks, and enables joint image-video training with comprehensive evaluation across multiple benchmarks.

Reinforcement learning from verifiable rewards(RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR tovideo understandingbecomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present EasyVideoR1, a complete and efficient reinforcement learning framework specifically designed for traininglarge vision-language modelsonvideo understandingtasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline withoffline preprocessingandtensor cachingthat eliminates redundant video decoding and yields a 1.47 times throughput improvement; (2) a comprehensive, task-awarereward systemcovering 11 distinct video and image problem types with unified routing and modular extension; (3) amixed offline-online data trainingparadigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4)joint image-video trainingwith independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) anasynchronous multi-benchmark evaluationframework covering 22 mainstreamvideo understandingbenchmarks, with reproduced accuracy closely aligned with officially reported scores.

View arXiv pageView PDFGitHub9Add to collection

Get this paper in your agent:

hf papers read 2604\.16893

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.16893 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.16893 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.16893 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Hugging Face Daily Papers

UniDoc-RL presents a reinforcement learning framework for Large Vision-Language Models that optimizes retrieval, reranking, and visual reasoning through hierarchical decision-making and dense multi-reward supervision, achieving up to 17.7% improvements over prior RL-based methods on visual RAG tasks.

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Hugging Face Daily Papers

OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.

RL²: Fast reinforcement learning via slow reinforcement learning

OpenAI Blog

RL² proposes encoding a fast reinforcement learning algorithm as the weights of a recurrent neural network, learned through slow general-purpose RL, enabling agents to adapt to new tasks with few trials similar to biological learning. The method demonstrates strong performance on both small-scale bandit problems and large-scale vision-based navigation tasks.