ChatPaper.aiChatPaper

思维增强策略优化:架起外部指导与内部能力的桥梁

Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

May 21, 2025
作者: Jinyang Wu, Chonghua Liao, Mingkuan Feng, Shuai Zhang, Zhengqi Wen, Pengpeng Shao, Huazhe Xu, Jianhua Tao
cs.AI

摘要

强化学习(RL)已成为训练推理模型的有效方法。然而,现有的RL方法通常会将模型的输出分布偏向于奖励最大化的路径,而未能引入外部知识。这限制了其探索能力,导致与基础模型相比,推理能力边界更为狭窄。为解决这一局限,我们提出了TAPO(思维增强策略优化),这是一个通过融入外部高层次指导(“思维模式”)来增强RL的新颖框架。通过在训练过程中自适应地整合结构化思维,TAPO有效地平衡了模型内部的探索与外部指导的利用。大量实验表明,我们的方法在AIME上显著优于GRPO达99%,在AMC上提升41%,在Minerva Math上提高17%。值得注意的是,这些仅从500个先验样本中抽象出的高层次思维模式,能够有效泛化到各种任务和模型中。这凸显了TAPO在跨任务和跨领域应用中的广泛潜力。我们的进一步分析表明,引入外部指导能够生成具有更强推理能力的模型,其推理行为具有更优的可解释性,输出结果的可读性也得到显著提升。
English
Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance ("thought patterns"). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO's potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.

Summary

AI-Generated Summary

PDF142May 26, 2025