ChatPaper.aiChatPaper

思考与否?基于强化学习的视觉语言模型选择性推理

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

May 22, 2025
作者: Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou
cs.AI

摘要

强化学习(RL)已被证明是一种有效的后训练策略,能够提升视觉语言模型(VLMs)的推理能力。群体相对策略优化(GRPO)是近期一种显著的方法,它鼓励模型在回答问题前生成完整的推理轨迹,这导致了令牌使用量和计算成本的增加。受人类思维过程的启发——人们在面对简单问题时跳过推理,而在需要时仔细思考——我们探索如何让VLMs首先判断何时需要推理。为实现这一目标,我们提出了TON,一种两阶段训练策略:(i)监督微调(SFT)阶段,采用简单而有效的“思维丢弃”操作,即随机将推理轨迹替换为空思维。这引入了“思考与否”的格式,为选择性推理提供了冷启动;(ii)GRPO阶段,使模型能够自由探索何时思考或跳过,同时最大化任务感知的结果奖励。实验结果显示,与原始GRPO相比,TON能够将完成长度减少高达90%,且不牺牲性能甚至有所提升。在涵盖3B和7B模型下多种推理难度的多样化视觉语言任务中的进一步评估一致表明,随着训练的推进,模型逐渐学会跳过不必要的推理步骤。这些发现为强化学习方法中实现类人推理模式指明了方向。我们的代码可在https://github.com/kokolerk/TON获取。
English
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at https://github.com/kokolerk/TON.

Summary

AI-Generated Summary

PDF72May 23, 2025