空間感知自監督強化學習:透過自我監督強化學習提升空間理解能力
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
October 31, 2025
作者: Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
cs.AI
摘要
空間理解能力始終是大型視覺語言模型(LVLM)的薄弱環節。現有的監督式微調(SFT)與近期採用可驗證獎勵的強化學習(RLVR)流程,皆依賴成本高昂的監督標註、專用工具或受限環境,難以實現規模化擴展。我們提出 Spatial-SSRL——一種自監督強化學習範式,能直接從普通 RGB 或 RGB-D 影像中提取可驗證信號。該方法自動構建五項捕捉二維與三維空間結構的預訓練任務:亂序圖塊重排、翻轉圖塊識別、裁剪圖塊修補、區域深度排序及相對三維位置預測。這些任務提供易於驗證的真實答案,無需人工或 LVLM 標註。基於此任務集的訓練能顯著提升空間推理能力,同時保持通用視覺能力。在涵蓋圖像與影片的七項空間理解基準測試中,Spatial-SSRL 相較 Qwen2.5-VL 基線模型平均準確率提升達 4.63%(3B 參數)與 3.89%(7B 參數)。實驗結果表明,簡潔的內在監督機制可實現大規模 RLVR,為增強 LVLM 的空間智能提供實用路徑。
English
Spatial understanding remains a weakness of Large Vision-Language Models
(LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement
learning with verifiable rewards (RLVR) pipelines depend on costly supervision,
specialized tools, or constrained environments that limit scale. We introduce
Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals
directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically
formulates five pretext tasks that capture 2D and 3D spatial structure:
shuffled patch reordering, flipped patch recognition, cropped patch inpainting,
regional depth ordering, and relative 3D position prediction. These tasks
provide ground-truth answers that are easy to verify and require no human or
LVLM annotation. Training on our tasks substantially improves spatial reasoning
while preserving general visual capabilities. On seven spatial understanding
benchmarks in both image and video settings, Spatial-SSRL delivers average
accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our
results show that simple, intrinsic supervision enables RLVR at scale and
provides a practical route to stronger spatial intelligence in LVLMs.