ChatPaper.aiChatPaper

思考與否?基於強化學習的視覺語言模型選擇性推理

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

May 22, 2025
作者: Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou
cs.AI

摘要

強化學習(Reinforcement Learning, RL)已被證實為一種有效的後訓練策略,能夠提升視覺語言模型(Vision-Language Models, VLMs)的推理能力。群組相對策略優化(Group Relative Policy Optimization, GRPO)是近期一種顯著的方法,它鼓勵模型在回答前生成完整的推理軌跡,從而增加了令牌使用量和計算成本。受人類思維過程的啟發——人們在面對簡單問題時會跳過推理,而在需要時則會仔細思考——我們探索如何讓VLMs首先決定何時需要進行推理。為實現這一目標,我們提出了TON,一種兩階段的訓練策略:(i)監督微調(Supervised Fine-Tuning, SFT)階段,採用簡單而有效的“思維丟棄”操作,即隨機將推理軌跡替換為空思維。這引入了一種“思考與否”的格式,作為選擇性推理的冷啟動;(ii)GRPO階段,使模型能夠自由探索何時思考或跳過,同時最大化任務感知的結果獎勵。實驗結果顯示,與基礎GRPO相比,TON能夠將完成長度減少高達90%,且不犧牲性能甚至有所提升。在涵蓋多種視覺語言任務的進一步評估中——包括3B和7B模型下的一系列推理難度——一致表明,隨著訓練的推進,模型逐漸學會繞過不必要的推理步驟。這些發現為強化學習方法中實現類人推理模式提供了啟示。我們的代碼可在https://github.com/kokolerk/TON 獲取。
English
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at https://github.com/kokolerk/TON.

Summary

AI-Generated Summary

PDF72May 23, 2025