EasyVideoR1: Easier RL for Video Understanding

Hugging Face Daily Papers 04/18/26, 12:00 AM Papers

Summary

EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.

Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present EasyVideoR1, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 times throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - EasyVideoR1: Easier RL for Video Understanding

Source: https://huggingface.co/papers/2604.16893

Abstract

EasyVideoR1 presents an efficient reinforcement learning framework for video understanding that improves training throughput, supports diverse video tasks, and enables joint image-video training with comprehensive evaluation across multiple benchmarks.

Reinforcement learning from verifiable rewards(RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR tovideo understandingbecomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present EasyVideoR1, a complete and efficient reinforcement learning framework specifically designed for traininglarge vision-language modelsonvideo understandingtasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline withoffline preprocessingandtensor cachingthat eliminates redundant video decoding and yields a 1.47 times throughput improvement; (2) a comprehensive, task-awarereward systemcovering 11 distinct video and image problem types with unified routing and modular extension; (3) amixed offline-online data trainingparadigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4)joint image-video trainingwith independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) anasynchronous multi-benchmark evaluationframework covering 22 mainstreamvideo understandingbenchmarks, with reproduced accuracy closely aligned with officially reported scores.

View arXiv page View PDF GitHub9 Add to collection

Get this paper in your agent:

hf papers read 2604\.16893

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.16893 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.16893 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.16893 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

EasyVideoR1: Easier RL for Video Understanding

Paper page - EasyVideoR1: Easier RL for Video Understanding

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Visual Reasoning through Tool-supervised Reinforcement Learning

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

RL²: Fast reinforcement learning via slow reinforcement learning

Submit Feedback

Similar Articles

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

Visual Reasoning through Tool-supervised Reinforcement Learning

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

RL²: Fast reinforcement learning via slow reinforcement learning