逆強化学習と大規模言語モデルのポストトレーニングの融合：基礎、進展、そして機会

要旨

大規模言語モデル（LLMs）の時代において、アライメントは、より信頼性が高く、制御可能で、能力のある機械知能を追求する上で、基本的でありながら困難な問題として浮上している。推論モデルや会話型AIシステムの最近の成功は、これらのシステムを強化するための強化学習（RL）の重要な役割を強調し、RLとLLMアライメントの交差点における研究関心の高まりを引き起こしている。本論文は、逆強化学習（IRL）の視点を通じて、LLMアライメントにおける最近の進展を包括的にレビューし、LLMアライメントで使用されるRL技術と従来のRLタスクで使用される技術の違いを強調する。特に、人間のデータからニューラル報酬モデルを構築する必要性を強調し、このパラダイムシフトの形式的および実践的な意味について議論する。まず、RLの基本概念を紹介し、この分野に不慣れな読者に基礎を提供する。次に、この研究アジェンダにおける最近の進展を検討し、LLMアライメントのためのIRLを実施する上での主要な課題と機会について議論する。方法論的な考察を超えて、データセット、ベンチマーク、評価指標、インフラストラクチャ、計算効率の高いトレーニングおよび推論技術などの実践的な側面を探る。最後に、スパース報酬RLに関する文献から洞察を引き出し、未解決の疑問と潜在的な研究方向性を特定する。多様な研究からの知見を統合することで、この分野の構造的かつ批判的な概要を提供し、未解決の課題を強調し、RLおよびIRL技術を通じてLLMアライメントを改善するための有望な将来の方向性を概説することを目指す。

English

In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.

逆強化学習と大規模言語モデルのポストトレーニングの融合：基礎、進展、そして機会

Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities

要旨

Support