航行於星辰之間的人工智慧：大型語言模型在訓練後與測試時獎勵學習的擴展研究綜述

摘要

近期大型语言模型（LLMs）的发展已从预训练规模扩展转向了训练后及测试时的规模扩展。在这一系列发展中，一个关键的统一范式应运而生：奖励学习，其中奖励信号如同指引方向的北极星，引导LLM的行为。这一范式支撑了众多主流技术，如强化学习（在RLHF、DPO和GRPO中）、奖励引导的解码以及事后校正。尤为重要的是，该范式实现了从静态数据的被动学习向动态反馈的主动学习的转变，从而赋予LLMs以对齐的偏好和深层次的推理能力。在本综述中，我们全面概述了奖励学习这一范式，将其在训练、推理及推理后阶段所采用的策略进行了分类与分析。此外，我们还探讨了奖励模型的基准测试及其主要应用领域。最后，我们指出了当前面临的挑战与未来的研究方向。我们维护了一个相关论文的集合，地址为https://github.com/bobxwu/learning-from-rewards-llm-papers。

English

Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (in RLHF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities. In this survey, we present a comprehensive overview of the paradigm of learning from rewards. We categorize and analyze the strategies under this paradigm across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-llm-papers.

航行於星辰之間的人工智慧：大型語言模型在訓練後與測試時獎勵學習的擴展研究綜述

Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models

摘要

Support