각도는 거짓말하지 않는다: 모델 자체의 신호를 통해 훈련 효율적인 강화학습 해제하기

초록

현재 대규모 언어 모델(LLM)을 위한 강화 미세 조정(RFT) 패러다임은 균일한 데이터 샘플링 하에서 동일한 쿼리가 반복적으로 노출됨에 따라 샘플 비효율성 문제를 겪고 있습니다. 기존 연구에서는 휴리스틱 난이도 지표를 통한 커리큘럼 학습을 탐구했지만, 이러한 전략은 모델 자체가 생성하는 내재적 학습 신호를 간과함으로써 최적이 아닌 훈련 체계로 이어지는 한계를 보였습니다. 본 논문에서는 LLM이 특정 데이터로부터 학습할 수 있는 능력을 효과적으로 반영하는 모델 내재적 신호인 '각도 집중도(angle concentration)'를 식별합니다. 우리는 토큰 은닉 상태 벡터의 각도 분포와 그에 따른 그래디언트 간의 상관관계를 이론적 및 실증적으로 입증함으로써, 더 높은 각도 집중도를 보이는 데이터에 대한 학습 선호도를 밝혀냅니다. 이러한 발견에 영감을 받아, 우리는 그래디언트 주도 각도 정보 기반 탐색 강화 학습 프레임워크인 GAIN-RL을 제안합니다. GAIN-RL은 모델의 내재적 각도 집중도 신호를 활용하여 각 에포크마다 훈련 데이터를 동적으로 선택함으로써, 지속적으로 영향력 있는 그래디언트 업데이트를 보장하고 전반적인 훈련 효율성을 크게 향상시킵니다. 실험 평가 결과, GAIN-RL(GRPO)은 다양한 수학 및 코딩 작업과 다양한 모델 규모에서 훈련 효율성을 2.5배 이상 가속화하는 것으로 나타났습니다. 또한 GAIN-RL(GRPO)의 효율적인 샘플링은 데이터 효율적인 훈련을 가능하게 하여, 전체 훈련 데이터를 사용한 일반 GRPO 대비 절반의 데이터로도 더 나은 성능을 달성했습니다. 코드는 https://github.com/wangqinsi1/GAINRL/tree/main에서 공개되었습니다.

English

Current Reinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from sample inefficiency due to the redundant exposure of identical queries under uniform data sampling. While previous work has explored curriculum learning via heuristic difficulty metrics, these strategies exhibit limitations by neglecting the intrinsic learning signals generated by the model itself, thus leading to suboptimal training regimes. In this paper, we identify a model-inherent signal termed angle concentration that effectively reflects an LLM's capacity to learn from specific data. We theoretically and empirically demonstrate a correlation between the angular distribution of token hidden state vectors and the resulting gradient, revealing a learning preference for data exhibiting higher angle concentration. Inspired by this finding, we propose GAIN-RL, a Gradient-driven Angle-Informed Navigated RL framework. By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates and thus significantly enhancing overall training efficiency. Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5x acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. Furthermore, GAIN-RL (GRPO)'s efficient sampling yields data-efficient training, achieving better performance with half the original data compared to vanilla GRPO with full training data. Code is realsed at https://github.com/wangqinsi1/GAINRL/tree/main.

각도는 거짓말하지 않는다: 모델 자체의 신호를 통해 훈련 효율적인 강화학습 해제하기

Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals

초록

Support