邁向評估思維：基於演化獎勵模型的元策略優化

摘要

基於獎勵的大型語言模型（LLMs）對齊方法面臨兩個關鍵限制：一是易受獎勵欺騙的影響，即模型利用獎勵信號中的缺陷；二是當LLMs被用作獎勵模型時，依賴於脆弱且勞動密集型的提示工程。我們引入了元策略優化（Meta Policy Optimization, MPO），這是一個通過整合元獎勵模型來動態調整訓練過程中獎勵模型提示的框架，從而解決這些挑戰。在MPO中，元獎勵模型監控不斷變化的訓練環境，並持續調整獎勵模型的提示以保持高度對齊，提供一個抵抗策略利用的自適應獎勵信號。這種元學習方法促進了更穩定的策略優化，並大大減少了手動設計獎勵提示的需求。其性能與經過大量手工設計獎勵提示的模型相當或更優。此外，我們展示了MPO在問答和數學推理等多樣化任務中保持其有效性，而無需專門的獎勵設計。超越標準的RLAIF，MPO的元學習公式易於擴展到更高層次的對齊框架。總體而言，該方法解決了基於獎勵的RL對齊在理論和實踐中的挑戰，為更健壯和適應性強的對齊策略鋪平了道路。代碼和模型將公開分享。

English

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.

邁向評估思維：基於演化獎勵模型的元策略優化

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

摘要

Support