VideoSSR:影片自監督強化學習
VideoSSR: Video Self-Supervised Reinforcement Learning
November 9, 2025
作者: Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng
cs.AI
摘要
具可驗證獎勵的強化學習(RLVR)已顯著提升了多模態大型語言模型(MLLMs)的影片理解能力。然而,MLLMs的快速發展正超越現有影片資料集的複雜度,而人工標註高品質新資料的成本依然高昂。本研究探討一個關鍵問題:能否利用影片內在的豐富資訊,自我生成高品質且可驗證的訓練資料?為此,我們引入三項自監督預訓練任務:異常定位、物件計數與時序拼圖。我們建構了影片內在理解基準(VIUBench)以驗證這些任務的難度,結果顯示當前最先進的MLLMs在此類任務上表現明顯不足。基於這些預訓練任務,我們開發了VideoSSR-30K資料集,並提出VideoSSR——一種用於RLVR的新型影片自監督強化學習框架。在涵蓋四大影片領域(通用影片問答、長影片問答、時間定位與複雜推理)的17個基準測試中,廣泛實驗表明VideoSSR能持續提升模型性能,平均改進幅度超過5%。這些成果確立了VideoSSR作為開發更先進MLLMs影片理解能力的強效基礎框架。程式碼已公開於:https://github.com/lcqysl/VideoSSR。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has substantially
advanced the video understanding capabilities of Multimodal Large Language
Models (MLLMs). However, the rapid progress of MLLMs is outpacing the
complexity of existing video datasets, while the manual annotation of new,
high-quality data remains prohibitively expensive. This work investigates a
pivotal question: Can the rich, intrinsic information within videos be
harnessed to self-generate high-quality, verifiable training data? To
investigate this, we introduce three self-supervised pretext tasks: Anomaly
Grounding, Object Counting, and Temporal Jigsaw. We construct the Video
Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty,
revealing that current state-of-the-art MLLMs struggle significantly on these
tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset
and propose VideoSSR, a novel video self-supervised reinforcement learning
framework for RLVR. Extensive experiments across 17 benchmarks, spanning four
major video domains (General Video QA, Long Video QA, Temporal Grounding, and
Complex Reasoning), demonstrate that VideoSSR consistently enhances model
performance, yielding an average improvement of over 5\%. These results
establish VideoSSR as a potent foundational framework for developing more
advanced video understanding in MLLMs. The code is available at
https://github.com/lcqysl/VideoSSR.