ChatPaper.aiChatPaper

基于推演增强的视觉语言模型自校正学习

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

February 9, 2026
作者: Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang
cs.AI

摘要

视觉语言模型(VLMs)在解决复杂推理问题时,自我纠错能力至关重要。然而现有强化学习(RL)方法难以有效习得该能力,因为有效的自我纠错行为出现频率极低,导致学习信号极度稀疏。为应对这一挑战,我们提出纠错专用轨迹重组框架(Octopus),通过重组现有轨迹来合成密集的自我纠错样本,实现RL轨迹扩增。这种扩增既通过轨迹复用提升了样本效率,又通过均衡化监督稳定了RL优化过程。此外,我们引入响应掩码策略,将自我纠错与直接推理解耦,避免信号冲突,使两种行为都能被有效学习。基于此,我们开发出具备可控自我纠错能力的推理模型Octopus-8B。在7个基准测试中,该模型在开源VLMs中实现最先进性能,以仅0.72倍的单步训练时间超越最佳RLVR基线1.0个指标分。
English
Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only 0.72times training time per step.
PDF21February 12, 2026