迈向评估思维：基于进化奖励模型的元策略优化

摘要

基于奖励的大语言模型（LLM）对齐方法面临两大关键局限：一是易受奖励黑客攻击，即模型利用奖励信号中的漏洞；二是当LLM被用作奖励模型时，依赖于脆弱且劳动密集型的提示工程。我们提出了元策略优化（Meta Policy Optimization, MPO）框架，通过整合一个元奖励模型来动态调整训练过程中奖励模型的提示，从而应对这些挑战。在MPO中，元奖励模型监控不断变化的训练上下文，并持续调整奖励模型的提示，以保持高度对齐，提供一种能抵抗策略利用的自适应奖励信号。这种元学习方法促进了更稳定的策略优化，并大幅减少了对人工设计奖励提示的需求。其性能与经过大量手工调整奖励提示的模型相当甚至更优。此外，我们展示了MPO在问答和数学推理等多样化任务中保持有效性，无需专门设计的奖励机制。超越标准的RLAIF，MPO的元学习框架易于扩展至更高层次的对齐框架。总体而言，该方法解决了基于奖励的LLM强化学习对齐中的理论与实践难题，为更健壮和适应性强的对齐策略铺平了道路。代码与模型将公开分享。

English

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.

迈向评估思维：基于进化奖励模型的元策略优化

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

摘要

Support