空间自监督强化学习:通过自监督强化学习提升空间理解能力
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
October 31, 2025
作者: Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
cs.AI
摘要
空间理解能力仍是大型视觉语言模型(LVLM)的薄弱环节。现有的监督微调(SFT)与近期基于可验证奖励的强化学习(RLVR)方法依赖于成本高昂的监督信号、专用工具或受限环境,制约了其扩展性。我们提出Spatial-SSRL——一种自监督强化学习范式,可直接从普通RGB或RGB-D图像中提取可验证信号。该范式自动构建了五项捕捉二维与三维空间结构的预训练任务:乱序图像块重组、翻转图像块识别、裁剪图像块修复、区域深度排序以及相对三维位置预测。这些任务提供的真值答案易于验证,且无需人工或LVLM标注。基于本任务的训练在保持通用视觉能力的同时,显著提升了空间推理性能。在涵盖图像与视频场景的七项空间理解基准测试中,Spatial-SSRL相较Qwen2.5-VL基线模型实现了平均准确率提升(3B参数模型提升4.63%,7B参数模型提升3.89%)。实验结果表明,简单的内在监督机制可实现规模化RLVR训练,为增强LVLM的空间智能提供了实用路径。
English
Spatial understanding remains a weakness of Large Vision-Language Models
(LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement
learning with verifiable rewards (RLVR) pipelines depend on costly supervision,
specialized tools, or constrained environments that limit scale. We introduce
Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals
directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically
formulates five pretext tasks that capture 2D and 3D spatial structure:
shuffled patch reordering, flipped patch recognition, cropped patch inpainting,
regional depth ordering, and relative 3D position prediction. These tasks
provide ground-truth answers that are easy to verify and require no human or
LVLM annotation. Training on our tasks substantially improves spatial reasoning
while preserving general visual capabilities. On seven spatial understanding
benchmarks in both image and video settings, Spatial-SSRL delivers average
accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our
results show that simple, intrinsic supervision enables RLVR at scale and
provides a practical route to stronger spatial intelligence in LVLMs.