評価的思考に向けて：進化する報酬モデルを用いたメタポリシー最適化

要旨

大規模言語モデル（LLM）に対する報酬ベースのアライメント手法は、2つの主要な課題に直面しています。1つは、モデルが報酬信号の欠陥を悪用する「報酬ハッキング」への脆弱性、もう1つは、LLMを報酬モデルとして使用する際に、脆弱で労力を要するプロンプトエンジニアリングへの依存です。本論文では、これらの課題に対処するため、メタ報酬モデルを統合し、トレーニング中に報酬モデルのプロンプトを動的に洗練させる「メタポリシー最適化（MPO）」というフレームワークを提案します。MPOでは、メタ報酬モデルがトレーニングの文脈の変化を監視し、ポリシーによる悪用に耐える適応的な報酬信号を維持するために、報酬モデルのプロンプトを継続的に調整します。このメタ学習アプローチにより、より安定したポリシー最適化が促進され、手動での報酬プロンプト設計の必要性が大幅に削減されます。その結果、MPOは、手間をかけて手作りされた報酬プロンプトに導かれたモデルと同等またはそれ以上の性能を発揮します。さらに、MPOは、質問応答や数学的推論などの多様なタスクにおいて、特別な報酬設計を必要とせずにその有効性を維持することを示します。標準的なRLAIFを超えて、MPOのメタ学習定式化は、より高次のアライメントフレームワークに容易に拡張可能です。全体として、この手法は、LLMに対する報酬ベースのRLアライメントにおける理論的および実践的な課題に対処し、より堅牢で適応性の高いアライメント戦略への道を開きます。コードとモデルは公開される予定です。

English

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.

評価的思考に向けて：進化する報酬モデルを用いたメタポリシー最適化

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

要旨

Support