逆強化學習與大型語言模型後訓練的交匯：基礎、進展與機遇

摘要

在大規模語言模型（LLM）時代，對齊已成為追求更可靠、可控且能力更強的機器智能中的一個基礎性且具挑戰性的問題。近期推理模型與對話式人工智慧系統的成功，凸顯了強化學習（RL）在提升這些系統中的關鍵作用，從而推動了RL與LLM對齊交叉領域的研究興趣。本文透過逆向強化學習（IRL）的視角，全面回顧了LLM對齊的最新進展，強調了LLM對齊中採用的RL技術與傳統RL任務中的技術之間的區別。特別地，我們強調了從人類數據構建神經獎勵模型的必要性，並討論了這一範式轉變的形式與實際意義。我們首先介紹RL的基本概念，為不熟悉該領域的讀者奠定基礎。接著，我們檢視了這一研究議程的最新進展，討論了在LLM對齊中進行IRL的關鍵挑戰與機遇。除了方法論考量外，我們還探討了實際層面，包括數據集、基準、評估指標、基礎設施，以及計算效率高的訓練與推斷技術。最後，我們從稀疏獎勵RL的文獻中汲取見解，以識別未解問題與潛在的研究方向。透過綜合多樣化研究的發現，我們旨在提供該領域的結構化與批判性概述，強調未解決的挑戰，並勾勒出透過RL與IRL技術改進LLM對齊的未來研究方向。

English

In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.

逆強化學習與大型語言模型後訓練的交匯：基礎、進展與機遇

Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities

摘要

Support