Cooper: 대규모 언어 모델을 위한 강화 학습에서 정책 모델과 보상 모델의 공동 최적화

초록

대규모 언어 모델(LLM)은 추론 작업에서 뛰어난 성능을 보여왔으며, 강화 학습(RL)은 이러한 추론 능력을 향상시키는 핵심 알고리즘으로 작용합니다. 현재 두 가지 주요 보상 패러다임이 존재합니다: 모델 기반 보상과 규칙 기반 보상. 그러나 두 접근법 모두 한계를 가지고 있습니다: 규칙 기반 보상은 견고성이 부족하고, 모델 기반 보상은 보상 해킹에 취약합니다. 이러한 문제를 해결하기 위해, 우리는 정책 모델과 보상 모델을 공동으로 최적화하는 RL 프레임워크인 Cooper(Co-optimizing Policy Model and Reward Model)를 제안합니다. Cooper는 정답을 식별할 때 규칙 기반 보상의 높은 정밀도를 활용하고, 보상 모델의 지속적인 학습을 위해 동적으로 양성-음성 샘플 쌍을 구성하고 선택합니다. 이 설계는 견고성을 강화하고 보상 해킹의 위험을 완화합니다. Cooper를 더욱 지원하기 위해, 우리는 보상 모델을 위한 훈련 데이터를 효율적이고 정확하게 생성하는 하이브리드 주석 전략을 도입했습니다. 또한, 보상 모델이 참조 답변을 입력으로 받는 참조 기반 보상 모델링 패러다임을 제안합니다. 이 설계를 기반으로, VerifyRM이라는 보상 모델을 훈련시켰으며, 이 모델은 동일한 크기의 다른 모델들에 비해 VerifyBench에서 더 높은 정확도를 달성했습니다. 우리는 VerifyRM과 Cooper를 모두 사용하여 강화 학습을 수행했습니다. 실험 결과, Cooper는 보상 해킹을 완화할 뿐만 아니라 종단 간 RL 성능을 개선하는 것으로 나타났습니다. 예를 들어, Qwen2.5-1.5B-Instruct에서 평균 정확도가 0.54% 향상되었습니다. 우리의 연구 결과는 보상 모델을 동적으로 업데이트하는 것이 보상 해킹을 방지하는 효과적인 방법임을 보여주며, 보상 모델을 RL에 더 잘 통합하기 위한 참고 자료를 제공합니다.

English

Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.

Cooper: 대규모 언어 모델을 위한 강화 학습에서 정책 모델과 보상 모델의 공동 최적화

Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

초록

Support