《以星辰为航标:大语言模型训练后与测试时奖励学习扩展研究综述》
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
May 5, 2025
作者: Xiaobao Wu
cs.AI
摘要
近期,大型语言模型(LLMs)的发展重心已从预训练规模扩展转向了训练后及测试阶段的规模扩展。在这一系列进展中,一个核心的统一范式逐渐显现:奖励学习,其中奖励信号如同指引之星,引导着LLM的行为。这一范式支撑了众多主流技术,如强化学习(应用于RLHF、DPO和GRPO)、奖励导向的解码以及事后校正。尤为关键的是,该范式实现了从静态数据的被动学习向动态反馈的主动学习的转变,赋予LLMs以对齐的偏好和深层次的推理能力。在本综述中,我们全面概述了奖励学习这一范式,将其在训练、推理及推理后各阶段的策略进行了分类与分析。此外,我们还探讨了奖励模型的基准测试及其主要应用领域。最后,我们指出了当前面临的挑战与未来研究方向。相关论文合集持续更新于https://github.com/bobxwu/learning-from-rewards-llm-papers。
English
Recent developments in Large Language Models (LLMs) have shifted from
pre-training scaling to post-training and test-time scaling. Across these
developments, a key unified paradigm has arisen: Learning from Rewards, where
reward signals act as the guiding stars to steer LLM behavior. It has
underpinned a wide range of prevalent techniques, such as reinforcement
learning (in RLHF, DPO, and GRPO), reward-guided decoding, and post-hoc
correction. Crucially, this paradigm enables the transition from passive
learning from static data to active learning from dynamic feedback. This endows
LLMs with aligned preferences and deep reasoning capabilities. In this survey,
we present a comprehensive overview of the paradigm of learning from rewards.
We categorize and analyze the strategies under this paradigm across training,
inference, and post-inference stages. We further discuss the benchmarks for
reward models and the primary applications. Finally we highlight the challenges
and future directions. We maintain a paper collection at
https://github.com/bobxwu/learning-from-rewards-llm-papers.Summary
AI-Generated Summary