逆强化学习与大语言模型后训练的交汇：基础、进展与机遇

摘要

在大语言模型（LLMs）时代，对齐问题已成为追求更可靠、可控且强大的机器智能过程中一个基础而富有挑战性的课题。近期推理模型与对话式AI系统的成功，凸显了强化学习（RL）在提升这些系统中的关键作用，进而推动了RL与LLM对齐交叉领域研究兴趣的激增。本文通过逆向强化学习（IRL）的视角，全面回顾了LLM对齐领域的最新进展，着重强调了应用于LLM对齐的RL技术与传统RL任务中技术之间的区别。特别是，我们强调了从人类数据构建神经奖励模型的必要性，并探讨了这一范式转变在理论与实际中的意义。首先，我们介绍RL的基本概念，为不熟悉该领域的读者奠定基础。随后，我们审视了这一研究议程的最新进展，讨论了在LLM对齐中实施IRL所面临的关键挑战与机遇。除方法论考量外，我们还探讨了实践层面，包括数据集、基准测试、评估指标、基础设施以及计算高效的训练与推理技术。最后，我们从稀疏奖励RL的文献中汲取洞见，识别出开放性问题及潜在的研究方向。通过综合多项研究的成果，我们旨在为该领域提供一个结构化且批判性的概览，突出未解决的挑战，并勾勒出通过RL与IRL技术改进LLM对齐的广阔前景。

English

In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.

逆强化学习与大语言模型后训练的交汇：基础、进展与机遇

Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities

摘要

Support