LSPO: 대규모 언어 모델 정책 최적화를 위한 길이 인식 동적 샘플링 기법

초록

Deepseek-R1의 출시 이후, 검증 가능한 보상을 활용한 강화 학습(RLVR)은 추론 작업에 대형 언어 모델(LLM)을 훈련시키는 주요 접근법으로 자리 잡았습니다. 최근 연구는 주로 RLVR의 효율성과 효과를 높이기 위해 손실 함수를 수정하는 데 초점을 맞추어 왔습니다. 본 논문에서는 LLM의 과도한 사고(overthinking) 현상에 대한 연구를 바탕으로, 평균 응답 길이를 기반으로 각 단계에서 훈련 데이터를 동적으로 선택하는 새로운 메타-RLVR 알고리즘인 Length-aware Sampling for Policy Optimization(LSPO)을 제안합니다. 우리는 LSPO를 다양한 기본 모델과 데이터셋에 걸쳐 평가하며, 이 알고리즘이 학습 효과를 지속적으로 향상시킨다는 것을 입증합니다. 또한, 길이 신호를 동적 샘플링에 통합하는 대안적인 방법들을 검토하기 위한 상세한 어블레이션 연구를 수행함으로써 추가적인 통찰을 제공하고, 향후 연구를 위한 유망한 방향성을 제시합니다.

English

Since the release of Deepseek-R1, reinforcement learning with verifiable rewards (RLVR) has become a central approach for training large language models (LLMs) on reasoning tasks. Recent work has largely focused on modifying loss functions to make RLVR more efficient and effective. In this paper, motivated by studies of overthinking in LLMs, we propose Length-aware Sampling for Policy Optimization (LSPO), a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length. We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness. In addition, we conduct a detailed ablation study to examine alternative ways of incorporating length signals into dynamic sampling, offering further insights and highlighting promising directions for future research.

LSPO: 대규모 언어 모델 정책 최적화를 위한 길이 인식 동적 샘플링 기법

LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning

초록

Support