역강화학습과 대형 언어 모델 사후 훈련의 만남: 기초, 발전, 그리고 기회

초록

대규모 언어 모델(LLM) 시대에 정렬(alignment)은 더욱 신뢰할 수 있고, 제어 가능하며, 능력 있는 기계 지능을 추구하는 과정에서 근본적이면서도 도전적인 문제로 부상했습니다. 최근 추론 모델과 대화형 AI 시스템의 성공은 이러한 시스템을 강화하는 데 강화 학습(RL)이 중요한 역할을 한다는 점을 부각시켰으며, 이로 인해 RL과 LLM 정렬의 교차점에 대한 연구 관심이 더욱 증가하고 있습니다. 본 논문은 역강화 학습(IRL)의 관점에서 LLM 정렬 분야의 최근 발전을 종합적으로 검토하며, LLM 정렬에 사용되는 RL 기법과 기존 RL 작업에서의 RL 기법 간의 차이점을 강조합니다. 특히, 인간 데이터로부터 신경망 보상 모델을 구축하는 필요성을 강조하고, 이러한 패러다임 전환의 이론적 및 실질적 함의를 논의합니다. 먼저 RL의 기본 개념을 소개하여 해당 분야에 익숙하지 않은 독자들을 위한 기초를 제공합니다. 그런 다음 이 연구 주제의 최근 발전을 살펴보며, LLM 정렬을 위한 IRL 수행 시 주요 도전 과제와 기회를 논의합니다. 방법론적 고려 사항을 넘어, 데이터셋, 벤치마크, 평가 지표, 인프라, 그리고 계산적으로 효율적인 학습 및 추론 기법과 같은 실질적인 측면을 탐구합니다. 마지막으로, 희소 보상 RL에 관한 문헌에서 얻은 통찰을 바탕으로 미해결 문제와 잠재적인 연구 방향을 식별합니다. 다양한 연구 결과를 종합함으로써, 우리는 이 분야에 대한 구조적이고 비판적인 개요를 제공하고, 해결되지 않은 도전 과제를 강조하며, RL과 IRL 기법을 통해 LLM 정렬을 개선하기 위한 유망한 미래 방향을 제시하고자 합니다.

English

In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.

역강화학습과 대형 언어 모델 사후 훈련의 만남: 기초, 발전, 그리고 기회

Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities

초록

Support