N-GRPO: 임베딩 수준 이웃 혼합을 통한 향상된 정책 최적화

초록

대규모 언어 모델의 수학적 추론 성공은 롤아웃 단계에서 다양하고 타당한 해결 경로를 생성하는 데 크게 의존합니다. 그러나 현재의 롤아웃 기술은 근본적인 상충 관계에 직면해 있습니다. 토큰 수준 샘플링은 표현 방식만 다를 뿐 중복된 궤적을 자주 생성하는 반면, 임의 노이즈를 활용하는 임베딩 수준 방법은 종종 의미적 일관성을 저해합니다. 이를 해결하기 위해, 우리는 그룹 상대 정책 최적화(GRPO) 프레임워크에 통합된 새로운 탐색 전략인 N-GRPO를 소개합니다. 우리의 접근 방식은 토큰 수준 샘플링이나 고유한 임베딩 수준 노이즈에 의존하는 대신 의미적 이웃 혼합(Semantic Neighbor Mixing)을 활용합니다. 이 메커니즘은 앵커 토큰과 그와 가장 가까운 의미적 이웃의 임베딩을 혼합하여 입력 표현을 동적으로 구축함으로써, 로컬 의미적 다양체를 엄격히 준수하면서 다양성을 주입합니다. 다양한 크기의 DeepSeek-R1-Distill-Qwen 모델에 대한 실험 평가 결과, N-GRPO는 수학 추론 벤치마크에서 강력한 기준 모델 대비 일관된 성능 향상을 보일 뿐만 아니라, 분포 외 과제에서도 강력한 일반화 능력을 입증합니다.

English

The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-level sampling often yields redundant trajectories that differ only in rephrasing, while embedding-level methods utilizing random noise frequently disrupt semantic consistency. To resolve this, we introduce N-GRPO, a novel exploration strategy integrated into the Group Relative Policy Optimization (GRPO) framework. Rather than relying on token-level sampling or native embedding-level noise, our approach leverages Semantic Neighbor Mixing. This mechanism dynamically constructs input representations by mixing the embeddings of an anchor token and its nearest semantic neighbors, thereby injecting diversity while strictly adhering to the local semantic manifold. Experimental evaluations on the DeepSeek-R1-Distill-Qwen models across different sizes show that N-GRPO not only achieves consistent improvements over strong baselines on math reasoning benchmarks but also exhibits robust generalization capabilities on out-of-distribution tasks.