思考拡張型ポリシー最適化：外部ガイダンスと内部能力の統合

要旨

強化学習（RL）は、推論モデルの訓練において効果的な手法として注目を集めています。しかし、既存のRLアプローチでは、外部知識を導入せずにモデルの出力分布を報酬最大化の経路に偏らせる傾向があります。これにより、探索能力が制限され、ベースモデルと比較して推論能力の境界が狭くなります。この制約を解決するため、我々はTAPO（Thought-Augmented Policy Optimization）を提案します。TAPOは、外部の高次ガイダンス（「思考パターン」）を組み込むことでRLを拡張する新しいフレームワークです。訓練中に構造化された思考を適応的に統合することで、TAPOはモデル内部の探索と外部ガイダンスの活用を効果的にバランスさせます。大規模な実験により、我々のアプローチがAIMEで99%、AMCで41%、Minerva Mathで17%とGRPOを大幅に上回ることが示されました。特に、わずか500の事前サンプルから抽象化されたこれらの高次思考パターンは、様々なタスクやモデルに効果的に汎化します。これは、TAPOが複数のタスクやドメインにわたる広範な応用の可能性を秘めていることを示しています。さらに分析を行った結果、外部ガイダンスを導入することで、推論行動の優れた説明可能性と出力の読みやすさを備えた強力な推論モデルが生成されることが明らかになりました。

English

Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model's output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance ("thought patterns"). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO's potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.

思考拡張型ポリシー最適化：外部ガイダンスと内部能力の統合

Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

要旨

Support