前綴分組器：通過共享前綴前向實現高效的GRPO訓練

摘要

群組相對策略優化（Group Relative Policy Optimization, GRPO）通過從共享相同輸入前綴的候選輸出之間的相對比較中計算梯度，從而增強策略學習。儘管其效果顯著，GRPO在處理長共享前綴時引入了顯著的計算開銷，這些前綴必須為每個群組成員冗餘編碼。這種低效性在長上下文學習場景中成為主要的可擴展性瓶頸。我們提出了前綴群組器（Prefix Grouper），這是一種高效的GRPO訓練算法，通過共享前綴前向策略消除了冗餘的前綴計算。具體而言，通過將自注意力機制重構為兩部分，我們的方法使得共享前綴僅需編碼一次，同時保持完全的可微分性並與端到端訓練兼容。我們提供了理論和實證證據，證明前綴群組器在訓練上等同於標準GRPO：它產生相同的正向輸出和反向梯度，確保優化動態和最終策略性能保持不變。實證上，我們的實驗證實前綴群組器在顯著降低訓練計算成本的同時，特別是在長前綴場景中，實現了一致的結果。所提出的方法完全即插即用：它與現有的基於GRPO的架構兼容，並可以無縫集成到當前的訓練管道中作為直接替換，無需結構修改，僅需對輸入構建和注意力計算進行最小化更改。前綴群組器使得在相同的計算預算下可以使用更大的群組規模，從而提高了GRPO在更複雜任務和更大模型上的可擴展性。代碼現已於https://github.com/johncaged/PrefixGrouper提供。

English

Group Relative Policy Optimization (GRPO) enhances policy learning by computing gradients from relative comparisons among candidate outputs that share a common input prefix. Despite its effectiveness, GRPO introduces substantial computational overhead when processing long shared prefixes, which must be redundantly encoded for each group member. This inefficiency becomes a major scalability bottleneck in long-context learning scenarios. We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefix computation via a Shared-Prefix Forward strategy. In particular, by restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once, while preserving full differentiability and compatibility with end-to-end training. We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO: it yields identical forward outputs and backward gradients, ensuring that the optimization dynamics and final policy performance remain unchanged. Empirically, our experiments confirm that Prefix Grouper achieves consistent results while significantly reducing the computational cost of training, particularly in long-prefix scenarios. The proposed method is fully plug-and-play: it is compatible with existing GRPO-based architectures and can be seamlessly integrated into current training pipelines as a drop-in replacement, requiring no structural modifications and only minimal changes to input construction and attention computation. Prefix Grouper enables the use of larger group sizes under the same computational budget, thereby improving the scalability of GRPO to more complex tasks and larger models. Code is now available at https://github.com/johncaged/PrefixGrouper

前綴分組器：通過共享前綴前向實現高效的GRPO訓練

Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward

摘要

Support