ChatPaper.aiChatPaper

通过推演增强实现视觉语言模型的自校正学习

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

February 9, 2026
作者: Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang
cs.AI

摘要

自我校正在视觉语言模型(VLM)解决复杂推理问题中至关重要。然而现有强化学习方法难以习得该能力,因为有效的自我校正行为仅偶发出现,导致学习信号极度稀疏。为应对这一挑战,我们提出校正导向的轨迹重组框架Octopus,通过重组现有轨迹来合成密集的自我校正样本。这种增强方法既通过轨迹复用提高了样本效率,又通过均衡监督稳定了强化学习优化过程。此外,我们引入响应掩码策略,将自我校正与直接推理解耦,避免信号冲突并使两种行为都能有效学习。基于此,我们开发出具备可控自我校正能力的推理模型Octopus-8B。在7个基准测试中,该模型在开源视觉语言模型中实现最先进性能,以仅需0.72倍单步训练时间的代价,较最佳RLVR基线模型提升1.0个性能分。
English
Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only 0.72times training time per step.
PDF21February 12, 2026