星に導かれるAIの航海：大規模言語モデルの学習後およびテスト時スケーリングにおける報酬からの学習に関する調査

要旨

大規模言語モデル（LLMs）の最近の進展は、事前学習のスケーリングから、事後学習およびテスト時のスケーリングへと移行しています。これらの進展を通じて、一つの重要な統一パラダイムが浮上しています。それが「報酬からの学習」であり、報酬信号がLLMの行動を導く指針として機能します。このパラダイムは、強化学習（RLHF、DPO、GRPOなど）、報酬誘導デコーディング、事後修正など、幅広い主要技術の基盤となっています。特に重要なのは、このパラダイムが静的データからの受動的学習から、動的フィードバックからの能動的学習への移行を可能にすることです。これにより、LLMは整合した選好と深い推論能力を獲得します。本調査では、報酬からの学習パラダイムについて包括的な概観を提供します。このパラダイムに基づく戦略を、学習、推論、事後推論の各段階にわたって分類・分析します。さらに、報酬モデルのベンチマークと主要な応用例について議論します。最後に、課題と今後の方向性を強調します。関連論文のコレクションはhttps://github.com/bobxwu/learning-from-rewards-llm-papersで公開しています。

English

Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (in RLHF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities. In this survey, we present a comprehensive overview of the paradigm of learning from rewards. We categorize and analyze the strategies under this paradigm across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-llm-papers.

星に導かれるAIの航海：大規模言語モデルの学習後およびテスト時スケーリングにおける報酬からの学習に関する調査

Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models

要旨

Support