GFT: 편향 없는 그룹 이점과 동적 계수 보정을 통한 모방에서 보상 미세 조정으로

초록

대규모 언어 모델은 일반적으로 지도 미세 조정(SFT)과 강화 학습(RL)을 통해 후속 훈련을 거치지만, 효율적인 지식 주입과 강력한 일반화 능력을 효과적으로 통합하는 것은 여전히 어려운 과제로 남아 있습니다. 본 연구에서는 훈련 역학 분석을 통해 SFT가 극도로 희소한 암묵적 보상과 불안정한 역확률 가중치를 갖는 정책 경사 최적화의 특수한 경우로 해석될 수 있음을 보여줍니다. 이 두 요소가 함께 작용하여 단일 경로 의존성, 엔트로피 붕괴 및 그래디언트 폭발을 초래한다는 점을 확인했습니다. 이러한 진단 결과를 바탕으로, 본 연구에서는 두 가지 메커니즘을 통해 이러한 본질적 한계를 해결하는 통합 후속 훈련 프레임워크인 그룹 미세 조정(GFT)을 제안합니다. 첫째, 다양한 응답 그룹을 구성하고 정규화된 대조적 supervision을 도출하여 보상 희소성을 완화하는 그룹 어드밴티지 러닝(Group Advantage Learning)과, 둘째, 역확률 가중치를 적응적으로 제한하여 효율적인 지식 주입을 유지하면서 최적화를 안정화하는 동적 계수 수정(Dynamic Coefficient Rectification)입니다. 실험 결과, GFT는 SFT 기반 방법들을 일관적으로 능가하며 후속 RL 훈련과 보다 원활하게 통합되는 정책을 생성하는 것으로 나타났습니다.

English

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

GFT: 편향 없는 그룹 이점과 동적 계수 보정을 통한 모방에서 보상 미세 조정으로

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

초록

Support