プレフィックスグルーパー：共有プレフィックスフォワードによる効率的なGRPOトレーニング

要旨

Group Relative Policy Optimization (GRPO)は、共通の入力プレフィックスを共有する候補出力間の相対比較から勾配を計算することで、ポリシー学習を強化する。その有効性にもかかわらず、GRPOは長い共有プレフィックスを処理する際に、各グループメンバーに対して冗長にエンコードする必要があるため、大幅な計算オーバーヘッドを引き起こす。この非効率性は、長文脈学習シナリオにおける主要なスケーラビリティのボトルネックとなる。本論文では、冗長なプレフィックス計算を排除する効率的なGRPO訓練アルゴリズムであるPrefix Grouperを提案する。特に、セルフアテンションを2つの部分に再構築することで、共有プレフィックスを一度だけエンコードしつつ、完全な微分可能性とエンドツーエンド訓練との互換性を維持する。理論的および実験的な証拠を提供し、Prefix Grouperが標準GRPOと訓練的に等価であることを示す：同一の順方向出力と逆方向勾配を生成し、最適化ダイナミクスと最終的なポリシーパフォーマンスが変わらないことを保証する。実験により、Prefix Grouperが一貫した結果を達成しつつ、特に長いプレフィックスシナリオにおいて訓練の計算コストを大幅に削減することを確認した。提案手法は完全なプラグアンドプレイであり、既存のGRPOベースのアーキテクチャと互換性があり、現在の訓練パイプラインにシームレスに統合できるドロップイン代替として使用可能で、構造的な変更を必要とせず、入力構築とアテンション計算に最小限の変更のみを必要とする。Prefix Grouperは、同じ計算予算の下でより大きなグループサイズの使用を可能にし、それによりGRPOのスケーラビリティをより複雑なタスクや大規模モデルに拡張する。コードはhttps://github.com/johncaged/PrefixGrouperで公開されている。

English

Group Relative Policy Optimization (GRPO) enhances policy learning by computing gradients from relative comparisons among candidate outputs that share a common input prefix. Despite its effectiveness, GRPO introduces substantial computational overhead when processing long shared prefixes, which must be redundantly encoded for each group member. This inefficiency becomes a major scalability bottleneck in long-context learning scenarios. We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefix computation via a Shared-Prefix Forward strategy. In particular, by restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once, while preserving full differentiability and compatibility with end-to-end training. We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO: it yields identical forward outputs and backward gradients, ensuring that the optimization dynamics and final policy performance remain unchanged. Empirically, our experiments confirm that Prefix Grouper achieves consistent results while significantly reducing the computational cost of training, particularly in long-prefix scenarios. The proposed method is fully plug-and-play: it is compatible with existing GRPO-based architectures and can be seamlessly integrated into current training pipelines as a drop-in replacement, requiring no structural modifications and only minimal changes to input construction and attention computation. Prefix Grouper enables the use of larger group sizes under the same computational budget, thereby improving the scalability of GRPO to more complex tasks and larger models. Code is now available at https://github.com/johncaged/PrefixGrouper

プレフィックスグルーパー：共有プレフィックスフォワードによる効率的なGRPOトレーニング

Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward

要旨

Support