思維增強策略優化:橋接外部指導與內部能力
Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities
May 21, 2025
作者: Jinyang Wu, Chonghua Liao, Mingkuan Feng, Shuai Zhang, Zhengqi Wen, Pengpeng Shao, Huazhe Xu, Jianhua Tao
cs.AI
摘要
強化學習(Reinforcement Learning, RL)已成為訓練推理模型的有效方法。然而,現有的RL方法通常會將模型的輸出分佈偏向於獎勵最大化的路徑,而沒有引入外部知識。這限制了其探索能力,並導致其推理能力邊界相比基礎模型更為狹窄。為解決這一限制,我們提出了TAPO(Thought-Augmented Policy Optimization,思維增強策略優化),這是一個新穎的框架,通過整合外部的高層次指導(“思維模式”)來增強RL。通過在訓練過程中自適應地整合結構化思維,TAPO有效地平衡了模型內部的探索與外部指導的利用。大量實驗表明,我們的方法在AIME上比GRPO提升了99%,在AMC上提升了41%,在Minerva Math上提升了17%。值得注意的是,這些高層次思維模式僅從500個先驗樣本中抽象出來,卻能有效地泛化到各種任務和模型中。這凸顯了TAPO在多任務和多領域中的廣泛應用潛力。我們進一步的分析表明,引入外部指導能夠產生具有卓越推理行為可解釋性和增強輸出可讀性的強大推理模型。
English
Reinforcement learning (RL) has emerged as an effective method for training
reasoning models. However, existing RL approaches typically bias the model's
output distribution toward reward-maximizing paths without introducing external
knowledge. This limits their exploration capacity and results in a narrower
reasoning capability boundary compared to base models. To address this
limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel
framework that augments RL by incorporating external high-level guidance
("thought patterns"). By adaptively integrating structured thoughts during
training, TAPO effectively balances model-internal exploration and external
guidance exploitation. Extensive experiments show that our approach
significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva
Math. Notably, these high-level thought patterns, abstracted from only 500
prior samples, generalize effectively across various tasks and models. This
highlights TAPO's potential for broader applications across multiple tasks and
domains. Our further analysis reveals that introducing external guidance
produces powerful reasoning models with superior explainability of inference
behavior and enhanced output readability.Summary
AI-Generated Summary