《以星辰为航标：大语言模型训练后与测试时奖励学习扩展研究综述》

摘要

近期，大型语言模型（LLMs）的发展重心已从预训练规模扩展转向了训练后及测试阶段的规模扩展。在这一系列进展中，一个核心的统一范式逐渐显现：奖励学习，其中奖励信号如同指引之星，引导着LLM的行为。这一范式支撑了众多主流技术，如强化学习（应用于RLHF、DPO和GRPO）、奖励导向的解码以及事后校正。尤为关键的是，该范式实现了从静态数据的被动学习向动态反馈的主动学习的转变，赋予LLMs以对齐的偏好和深层次的推理能力。在本综述中，我们全面概述了奖励学习这一范式，将其在训练、推理及推理后各阶段的策略进行了分类与分析。此外，我们还探讨了奖励模型的基准测试及其主要应用领域。最后，我们指出了当前面临的挑战与未来研究方向。相关论文合集持续更新于https://github.com/bobxwu/learning-from-rewards-llm-papers。

English

Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (in RLHF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities. In this survey, we present a comprehensive overview of the paradigm of learning from rewards. We categorize and analyze the strategies under this paradigm across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-llm-papers.

《以星辰为航标：大语言模型训练后与测试时奖励学习扩展研究综述》

Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models

摘要

Support