정상보다 등정이 지혜를 더 깊게 새긴다: 추론 학습에서의 잡음 섞인 보상에 관하여

초록

최근 강화 학습(RL)을 통해 대규모 언어 모델(LLM)의 추론 능력을 사후 학습(post-training)하는 연구는 주로 수학 문제 해결과 같이 정확히 검증하고 보상할 수 있는 과제에 초점을 맞추고 있습니다. 반면, 본 연구는 보상 모델을 사용한 LLM 사후 학습에 있어 현실 세계 시나리오에서 더 실용적인 고려 사항인 보상 노이즈의 영향을 조사합니다. 연구 결과, LLM은 상당한 보상 노이즈에 대해 강력한 견고성을 보였습니다. 예를 들어, 수학 과제에서 보상 함수 출력의 40%를 수동으로 뒤집더라도 Qwen-2.5-7B 모델은 빠른 수렴을 달성하며, 수학 과제에서의 성능을 5%에서 72%로 향상시켰는데, 이는 노이즈 없는 보상으로 학습된 모델이 달성한 75% 정확도에 근접한 수치입니다. 놀랍게도, 답변의 정확성을 검증하지 않고 단지 "먼저, 나는 ~해야 한다"와 같은 핵심 추론 구문(즉, 추론 패턴 보상, RPR)의 출현만을 보상했을 때, 모델은 엄격한 정확성 검증과 정확한 보상으로 학습된 모델과 비슷한 최고의 다운스트림 성능(Qwen-2.5-7B 기준 70% 이상의 정확도)을 달성했습니다. 최종 결과보다 추론 과정의 중요성을 인식하여, 우리는 RPR을 노이즈가 있는 보상 모델과 결합했습니다. RPR은 노이즈가 있는 보상 모델을 보정하여 잠재적인 거짓 부정(false negatives)을 완화하고, LLM의 개방형 과제에서의 성능을 향상시키는 데 도움을 주었습니다. 이러한 발견은 사전 학습 단계에서 모델의 기초 능력을 개선하는 것의 중요성을 시사하며, 사후 학습 기술을 발전시키기 위한 통찰을 제공합니다. 본 연구의 코드와 스크립트는 https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason에서 확인할 수 있습니다.

English

Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.

정상보다 등정이 지혜를 더 깊게 새긴다: 추론 학습에서의 잡음 섞인 보상에 관하여

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

초록

Support