思維增強策略優化：橋接外部指導與內部能力

摘要

強化學習（Reinforcement Learning, RL）已成為訓練推理模型的有效方法。然而，現有的RL方法通常會將模型的輸出分佈偏向於獎勵最大化的路徑，而沒有引入外部知識。這限制了其探索能力，並導致其推理能力邊界相比基礎模型更為狹窄。為解決這一限制，我們提出了TAPO（Thought-Augmented Policy Optimization，思維增強策略優化），這是一個新穎的框架，通過整合外部的高層次指導（“思維模式”）來增強RL。通過在訓練過程中自適應地整合結構化思維，TAPO有效地平衡了模型內部的探索與外部指導的利用。大量實驗表明，我們的方法在AIME上比GRPO提升了99%，在AMC上提升了41%，在Minerva Math上提升了17%。值得注意的是，這些高層次思維模式僅從500個先驗樣本中抽象出來，卻能有效地泛化到各種任務和模型中。這凸顯了TAPO在多任務和多領域中的廣泛應用潛力。我們進一步的分析表明，引入外部指導能夠產生具有卓越推理行為可解釋性和增強輸出可讀性的強大推理模型。

English

Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance ("thought patterns"). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO's potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.

思維增強策略優化：橋接外部指導與內部能力

Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

摘要

Support