EasyVideoR1：简化视频理解中的强化学习

摘要

基于可验证奖励的强化学习（RLVR）在提升大语言模型推理能力方面已展现出显著成效。随着模型向原生多模态架构演进，将RLVR扩展至视频理解领域的重要性日益凸显，但该方向仍存在大量未解难题：视频任务类型多样、高维视觉输入重复解码与预处理的计算开销巨大、众多敏感超参数下难以实现可复现评估等。现有开源RL训练框架虽为文本和图像场景提供了坚实基础，却缺乏针对视频模态的系统性优化。本文提出EasyVideoR1——一个专为视频理解任务训练大视觉语言模型设计的完整高效强化学习框架，其核心贡献包括：（1）具备离线预处理与张量缓存的全流程视频RL训练管线，消除冗余视频解码操作，实现1.47倍吞吐量提升；（2）覆盖11类视频与图像问题的综合性任务感知奖励系统，支持统一路由与模块化扩展；（3）融合高质量人工标注轨迹与在线策略探索的混合式训练范式，有效促进复杂任务学习；（4）支持独立配置像素预算的图像-视频联合训练机制，实现跨模态能力协同增强；（5）涵盖22个主流视频理解基准的异步多基准评估框架，复现精度与官方报告指标高度吻合。

English

Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present EasyVideoR1, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 times throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.