대규모 언어 모델 사후 훈련에 대한 통합적 관점을 향하여

초록

현대 언어 모델의 사후 학습을 위한 훈련 데이터는 크게 두 가지 주요 출처가 있습니다: 온라인(모델 생성 롤아웃) 데이터와 오프라인(인간 또는 다른 모델의 데모) 데이터입니다. 이 두 가지 유형의 데이터는 일반적으로 강화 학습(RL)과 지도 미세 조정(SFT)과 같은 접근 방식에서 각각 사용됩니다. 본 논문에서는 이러한 접근 방식이 상충되지 않으며, 단일 최적화 과정의 사례임을 보여줍니다. 우리는 통합 정책 경사 추정기(Unified Policy Gradient Estimator)를 도출하고, 다양한 데이터 분포 가정과 여러 편향-분산 트레이드오프 하에서 공통 목적 함수의 경사로 다양한 사후 학습 접근 방식의 계산을 제시합니다. 이 경사 추정기는 안정화 마스크, 참조 정책 분모, 이점 추정치, 그리고 가능도 경사라는 네 가지 상호 교환 가능한 부분으로 구성됩니다. 우리의 이론적 발견에 동기를 받아, 우리는 다양한 훈련 신호를 동적으로 선택하는 하이브리드 사후 학습(Hybrid Post-Training, HPT) 알고리즘을 제안합니다. HPT는 학습된 추론 패턴을 희생하지 않으면서도 데모의 효과적인 활용과 안정적인 탐색을 모두 달성하도록 설계되었습니다. 우리는 통합 이론 프레임워크와 HPT의 효과를 검증하기 위해 광범위한 실험과 어블레이션 연구를 제공합니다. 여섯 가지 수학적 추론 벤치마크와 두 가지 분포 외 데이터 세트에서, HPT는 다양한 규모와 계열의 모델에 걸쳐 강력한 베이스라인을 지속적으로 능가합니다.

English

Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.

대규모 언어 모델 사후 훈련에 대한 통합적 관점을 향하여

Towards a Unified View of Large Language Model Post-Training

초록

Support