별을 따라 항해하는 AI: 대규모 언어 모델의 사후 학습 및 테스트 시점 스케일링에서 보상 학습에 대한 조사

초록

대형 언어 모델(LLMs)의 최근 발전은 사전 학습 규모 확장에서 사후 학습 및 테스트 시점 규모 확장으로 전환되고 있다. 이러한 발전 과정에서 주요 통합 패러다임이 등장하였는데, 이는 '보상으로부터 학습하기(Learning from Rewards)'로, 보상 신호가 LLM의 행동을 이끄는 지침 역할을 한다. 이 패러다임은 강화 학습(RLHF, DPO, GRPO 등), 보상 기반 디코딩, 사후 수정과 같은 다양한 주요 기술의 기반이 되었다. 특히, 이 패러다임은 정적 데이터로부터의 수동 학습에서 동적 피드백으로부터의 능동 학습으로의 전환을 가능하게 한다. 이를 통해 LLM은 정렬된 선호도와 심층 추론 능력을 갖추게 된다. 본 논문에서는 보상으로부터 학습하기 패러다임에 대한 포괄적인 개요를 제시한다. 이 패러다임 하에서의 전략을 학습, 추론, 사후 추론 단계로 분류하고 분석한다. 또한, 보상 모델을 위한 벤치마크와 주요 응용 분야에 대해 논의한다. 마지막으로, 이 분야의 도전 과제와 미래 방향성을 강조한다. 관련 논문 목록은 https://github.com/bobxwu/learning-from-rewards-llm-papers에서 확인할 수 있다.

English

Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (in RLHF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities. In this survey, we present a comprehensive overview of the paradigm of learning from rewards. We categorize and analyze the strategies under this paradigm across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-llm-papers.

별을 따라 항해하는 AI: 대규모 언어 모델의 사후 학습 및 테스트 시점 스케일링에서 보상 학습에 대한 조사

Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models

초록

Support