EasyVideoR1:更簡易的影片理解強化學習方法
EasyVideoR1: Easier RL for Video Understanding
April 18, 2026
作者: Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang
cs.AI
摘要
基於可驗證獎勵的強化學習(RLVR)在提升大型語言模型推理能力方面已展現顯著成效。隨著模型演進為原生多模態架構,將RLVR擴展至影片理解領域變得日益重要,但由於影片任務類型多樣性、高維度視覺輸入重複解碼與預處理的計算開銷,以及眾多敏感超參數下可重現評估的困難,該領域仍處於未充分探索狀態。現有開源RL訓練框架雖為文本和圖像場景提供堅實基礎設施,但缺乏針對影片模態的系統化優化。本研究提出EasyVideoR1——專為影片理解任務訓練大型視覺語言模型設計的完整高效強化學習框架,其貢獻包括:(1)具離線預處理與張量快取的全流程影片RL訓練管線,消除冗餘影片解碼並實現1.47倍吞吐量提升;(2)涵蓋11種影片與圖像問題類型的綜合性任務感知獎勵系統,具統一路由與模組化擴展能力;(3)融合高質量軌跡離線訓練與線上策略探索的混合訓練範式,助力困難任務學習;(4)支持獨立配置像素預算的圖像-影片聯合訓練,實現跨模態相互增強;(5)覆蓋22個主流影片理解基準的異步多基準評估框架,再現精度與官方報告分數高度吻合。
English
Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present EasyVideoR1, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 times throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.