GTR:引導式思維強化防止基於強化學習的視覺語言模型代理訓練中的思維崩潰
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
March 11, 2025
作者: Tong Wei, Yijun Yang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye
cs.AI
摘要
基於可驗證結果獎勵的強化學習(RLVR)已有效擴展了大語言模型(LLMs)中的思維鏈(CoT)推理能力。然而,其在訓練視覺語言模型(VLM)代理於視覺環境中進行目標導向行動推理的效果尚不明確。本研究通過在複雜紙牌遊戲(如24點)及ALFWorld中的具身任務上進行廣泛實驗,探討了這一問題。我們發現,當獎勵僅基於行動結果時,RL無法激勵VLMs中的CoT推理,反而導致了一種我們稱之為“思維崩潰”的現象,其特徵是代理思維多樣性的迅速喪失、與狀態無關且不完整的推理,以及隨後的無效行動,最終導致負面獎勵。為應對思維崩潰,我們強調了過程指導的必要性,並提出了一種自動校正器,在每個RL步驟中評估並精煉代理的推理。這一簡單且可擴展的GTR(引導思維強化)框架無需密集的逐步人工標註,即可同步訓練推理與行動。我們的實驗表明,GTR顯著提升了LLaVA-7b模型在各種視覺環境中的表現與泛化能力,相比於模型規模顯著更小的現有最佳模型,任務成功率提高了3至5倍。
English
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively
scaled up chain-of-thought (CoT) reasoning in large language models (LLMs).
Yet, its efficacy in training vision-language model (VLM) agents for
goal-directed action reasoning in visual environments is less established. This
work investigates this problem through extensive experiments on complex card
games, such as 24 points, and embodied tasks from ALFWorld. We find that when
rewards are based solely on action outcomes, RL fails to incentivize CoT
reasoning in VLMs, instead leading to a phenomenon we termed thought collapse,
characterized by a rapid loss of diversity in the agent's thoughts,
state-irrelevant and incomplete reasoning, and subsequent invalid actions,
resulting in negative rewards. To counteract thought collapse, we highlight the
necessity of process guidance and propose an automated corrector that evaluates
and refines the agent's reasoning at each RL step. This simple and scalable GTR
(Guided Thought Reinforcement) framework trains reasoning and action
simultaneously without the need for dense, per-step human labeling. Our
experiments demonstrate that GTR significantly enhances the performance and
generalization of the LLaVA-7b model across various visual environments,
achieving 3-5 times higher task success rates compared to SoTA models with
notably smaller model sizes.Summary
AI-Generated Summary