샘플 라우팅을 통한 그룹 상대 및 자기 증류 정책 최적화 통합

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델의 사후 훈련을 위한 표준 패러다임으로 자리 잡았습니다. 그룹 상대 정책 최적화(GRPO)가 널리 채택되고 있지만, 이 방법의 거친 크레딧 할당은 실패한 롤아웃에 균일하게 불이익을 부여하여 특정 편차를 효율적으로 해결하는 데 필요한 토큰 수준의 집중력을 결여하고 있습니다. 자기 증류 정책 최적화(SDPO)는 더 조밀하고 표적화된 로짓 수준의 지도를 제공하여 빠른 초기 개선을 용이하게 하지만, 장기간 훈련 시 빈번히 붕괴되는 문제가 있습니다. 우리는 이러한 후기 단계 불안정성을 두 가지 내재적 결함으로 추적합니다: 이미 정확한 샘플에 대한 자기 증류는 최적화 모호성을 유발하며, 자기 교사 신호의 신뢰도가 점차 저하됩니다. 이러한 문제를 해결하기 위해 우리는 샘플 경로 정책 최적화(SRPO)를 제안합니다. SRPO는 정확한 샘플은 GRPO의 보상 정렬 강화 학습으로, 실패한 샘플은 SDPO의 표적화된 로짓 수준 수정으로 라우팅하는 통합 온-정책 프레임워크입니다. SRPO는 더 나아가 엔트로피 인식 동적 가중치 메커니즘을 통합하여 높은 엔트로피를 보이는 신뢰할 수 없는 증류 타겟을 억제하고 확신 있는 타겟을 강조합니다. 5개의 벤치마크와 2개의 모델 규모에서 평가된 SRPO는 SDPO의 빠른 초기 개선과 GRPO의 장기적 안정성을 모두 달성했습니다. SRPO는 두 기준선의 최고 성능을 일관적으로 능가하며, Qwen3-8B에서 5개 벤치마크 평균을 GRPO 대비 3.4%, SDPO 대비 6.3% 향상시켰을 뿐만 아니라, 적절한 응답 길이를 유지하면서 스텝당 계산 비용을 최대 17.2%까지 낮췄습니다.

English

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

샘플 라우팅을 통한 그룹 상대 및 자기 증류 정책 최적화 통합

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

초록

Support