VideoSSR:视频自监督强化学习
VideoSSR: Video Self-Supervised Reinforcement Learning
November 9, 2025
作者: Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)显著提升了多模态大语言模型(MLLMs)的视频理解能力。然而,MLLMs的快速发展正逐渐超越现有视频数据集的复杂度,而人工标注高质量新数据的成本依然居高不下。本研究探讨了一个关键问题:能否利用视频内蕴的丰富信息自生成高质量、可验证的训练数据?为此,我们引入了三项自监督前置任务:异常定位、目标计数和时间拼图。通过构建视频内在理解基准(VIUBench)验证任务难度,发现当前最先进的MLLMs在这些任务上表现欠佳。基于这些前置任务,我们开发了VideoSSR-30K数据集并提出VideoSSR——一种面向RLVR的新型视频自监督强化学习框架。在涵盖四大视频领域(通用视频问答、长视频问答、时间定位和复杂推理)的17个基准测试中,大量实验表明VideoSSR能持续提升模型性能,平均改进幅度超5%。这些成果确立了VideoSSR作为推动MLLMs实现更先进视频理解的强效基础框架。代码已开源:https://github.com/lcqysl/VideoSSR。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has substantially
advanced the video understanding capabilities of Multimodal Large Language
Models (MLLMs). However, the rapid progress of MLLMs is outpacing the
complexity of existing video datasets, while the manual annotation of new,
high-quality data remains prohibitively expensive. This work investigates a
pivotal question: Can the rich, intrinsic information within videos be
harnessed to self-generate high-quality, verifiable training data? To
investigate this, we introduce three self-supervised pretext tasks: Anomaly
Grounding, Object Counting, and Temporal Jigsaw. We construct the Video
Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty,
revealing that current state-of-the-art MLLMs struggle significantly on these
tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset
and propose VideoSSR, a novel video self-supervised reinforcement learning
framework for RLVR. Extensive experiments across 17 benchmarks, spanning four
major video domains (General Video QA, Long Video QA, Temporal Grounding, and
Complex Reasoning), demonstrate that VideoSSR consistently enhances model
performance, yielding an average improvement of over 5\%. These results
establish VideoSSR as a potent foundational framework for developing more
advanced video understanding in MLLMs. The code is available at
https://github.com/lcqysl/VideoSSR.