평가적 사고를 향하여: 진화하는 보상 모델을 활용한 메타 정책 최적화

초록

대규모 언어 모델(LLM)에 대한 보상 기반 정렬 방법은 두 가지 주요 한계에 직면해 있습니다: 보상 신호의 결함을 악용하는 모델의 보상 해킹 취약성과, LLM이 보상 모델로 사용될 때 취약하고 노동 집약적인 프롬프트 엔지니어링에 대한 의존성입니다. 우리는 이러한 문제를 해결하기 위해 메타 보상 모델을 통합하여 훈련 과정에서 보상 모델의 프롬프트를 동적으로 개선하는 메타 정책 최적화(MPO) 프레임워크를 소개합니다. MPO에서 메타 보상 모델은 진화하는 훈련 컨텍스트를 모니터링하고 정책의 악용에 저항하는 적응형 보상 신호를 제공하기 위해 보상 모델의 프롬프트를 지속적으로 조정합니다. 이 메타 학습 접근법은 더 안정적인 정책 최적화를 촉진하고 수동적인 보상 프롬프트 설계의 필요성을 크게 줄입니다. 이는 광범위하게 수작업으로 제작된 보상 프롬프트로 안내된 모델과 동등하거나 더 나은 성능을 제공합니다. 또한, MPO는 질문 응답 및 수학적 추론과 같은 다양한 작업에서 특수한 보상 설계 없이도 효과를 유지함을 보여줍니다. 표준 RLAIF를 넘어, MPO의 메타 학습 공식은 더 높은 수준의 정렬 프레임워크로 쉽게 확장 가능합니다. 전반적으로, 이 방법은 LLM에 대한 보상 기반 RL 정렬의 이론적 및 실질적 문제를 해결하여 더 강력하고 적응 가능한 정렬 전략의 길을 열어줍니다. 코드와 모델은 공개될 예정입니다.

English

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.

평가적 사고를 향하여: 진화하는 보상 모델을 활용한 메타 정책 최적화

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

초록

Support