N-GRPO：埋め込みレベル近傍混合による拡張型ポリシー最適化

要旨

大規模言語モデルにおける数学的推論の成功は、ロールアウト段階での多様で妥当な解経路の生成に大きく依存しています。しかし、現在のロールアウト手法には根本的なトレードオフが存在します。トークンレベルのサンプリングでは、言い換えのみが異なる冗長な軌道を生成しがちである一方、ランダムノイズを利用する埋め込みレベルの手法では、意味的一貫性が損なわれることが頻繁に発生します。この問題を解決するために、我々はN-GRPOを提案します。これは、グループ相対方策最適化（GRPO）フレームワークに統合された新規な探索戦略です。本手法は、トークンレベルのサンプリングや埋め込みレベルの単純なノイズに依存するのではなく、セマンティック近傍混合（Semantic Neighbor Mixing）を活用します。このメカニズムは、アンカートークンの埋め込みとその最も近い意味的近傍の埋め込みを混合することにより、入力表現を動的に構築し、局所的な意味的多様体に厳密に従いながら多様性を注入します。DeepSeek-R1-Distill-Qwenモデルを用いた異なるサイズでの実験評価によれば、N-GRPOは数学的推論ベンチマークにおいて強力なベースラインを一貫して上回るだけでなく、分布外タスクでも頑健な汎化能力を示すことが明らかになりました。

English

The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-level sampling often yields redundant trajectories that differ only in rephrasing, while embedding-level methods utilizing random noise frequently disrupt semantic consistency. To resolve this, we introduce N-GRPO, a novel exploration strategy integrated into the Group Relative Policy Optimization (GRPO) framework. Rather than relying on token-level sampling or native embedding-level noise, our approach leverages Semantic Neighbor Mixing. This mechanism dynamically constructs input representations by mixing the embeddings of an anchor token and its nearest semantic neighbors, thereby injecting diversity while strictly adhering to the local semantic manifold. Experimental evaluations on the DeepSeek-R1-Distill-Qwen models across different sizes show that N-GRPO not only achieves consistent improvements over strong baselines on math reasoning benchmarks but also exhibits robust generalization capabilities on out-of-distribution tasks.