대규모 언어 모델의 강화 미세 조정에서 엔트로피 역학에 관하여

초록

엔트로피는 대규모 언어 모델(LLM)이 생성하는 출력의 다양성을 측정하는 핵심 지표로 작동하며, 모델의 탐색 능력에 대한 유용한 통찰력을 제공합니다. 최근 연구들은 강화학습 미세 조정(RFT) 과정에서 탐색과 활용의 균형을 더 잘 맞추기 위해 엔트로피를 모니터링하고 조정하는 데 점차 초점을 맞추고 있지만, 이 과정에서의 엔트로피 역학에 대한 원칙적인 이해는 아직 충분히 연구되지 않았습니다. 본 논문에서는 RFT 과정에서의 엔트로피 역학을 분석하기 위한 이론적 프레임워크를 구축합니다. 이 프레임워크는 단일 로짓 업데이트 하에서 엔트로피 변화를 정량화하는 판별식으로부터 시작됩니다. 이 기초를 바탕으로 엔트로피 변화에 대한 1차 근사 표현식을 유도하며, 이는 그룹 상대 정책 최적화(GRPO)의 업데이트 공식으로 더 확장될 수 있습니다. 이론적 분석에서 도출된 추론과 통찰은 엔트로피 제어 방법의 설계에 영감을 주고, 기존 연구의 다양한 엔트로피 기반 방법을 해석하는 통합된 관점을 제공합니다. 우리는 분석의 주요 결론을 뒷받침하는 경험적 증거를 제시하고, 유도된 엔트로피-판별자 클리핑 방법의 효과성을 입증합니다. 본 연구는 RFT 훈련 역학에 대한 새로운 통찰을 제공함으로써 LLM 미세 조정 중 탐색-활용 균형을 최적화하기 위한 이론적 지원과 실용적인 전략을 마련합니다.

English

Entropy serves as a critical metric for measuring the diversity of outputs generated by large language models (LLMs), providing valuable insights into their exploration capabilities. While recent studies increasingly focus on monitoring and adjusting entropy to better balance exploration and exploitation in reinforcement fine-tuning (RFT), a principled understanding of entropy dynamics during this process is yet to be thoroughly investigated. In this paper, we establish a theoretical framework for analyzing the entropy dynamics during the RFT process, which begins with a discriminant expression that quantifies entropy change under a single logit update. This foundation enables the derivation of a first-order expression for entropy change, which can be further extended to the update formula of Group Relative Policy Optimization (GRPO). The corollaries and insights drawn from the theoretical analysis inspire the design of entropy control methods, and also offer a unified lens for interpreting various entropy-based methods in existing studies. We provide empirical evidence to support the main conclusions of our analysis and demonstrate the effectiveness of the derived entropy-discriminator clipping methods. This study yields novel insights into RFT training dynamics, providing theoretical support and practical strategies for optimizing the exploration-exploitation balance during LLM fine-tuning.

대규모 언어 모델의 강화 미세 조정에서 엔트로피 역학에 관하여

On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models

초록

Support