前缀分组器：通过共享前缀前向传播实现高效的GRPO训练

摘要

群组相对策略优化（GRPO）通过计算共享相同输入前缀的候选输出之间的相对比较梯度，增强了策略学习。尽管GRPO效果显著，但在处理长共享前缀时，它引入了显著的计算开销，因为每个群组成员都需要冗余编码这些前缀。这种低效性在长上下文学习场景中成为主要的可扩展性瓶颈。我们提出了前缀分组器（Prefix Grouper），一种高效的GRPO训练算法，通过共享前缀前向策略消除了冗余前缀计算。具体而言，通过将自注意力机制重构为两部分，我们的方法使得共享前缀仅需编码一次，同时保持完全可微分性，并与端到端训练兼容。我们提供了理论和实证证据，证明前缀分组器在训练上等同于标准GRPO：它产生相同的前向输出和后向梯度，确保优化动态和最终策略性能保持不变。实验证实，前缀分组器在显著降低训练计算成本的同时，特别是在长前缀场景下，能够取得一致的结果。所提出的方法完全即插即用：它与现有的基于GRPO的架构兼容，可以作为直接替换无缝集成到当前训练流程中，无需结构修改，仅需对输入构建和注意力计算进行最小化调整。前缀分组器使得在相同计算预算下能够使用更大的群组规模，从而提升GRPO在更复杂任务和更大模型上的可扩展性。代码现已发布于https://github.com/johncaged/PrefixGrouper。

English

Group Relative Policy Optimization (GRPO) enhances policy learning by computing gradients from relative comparisons among candidate outputs that share a common input prefix. Despite its effectiveness, GRPO introduces substantial computational overhead when processing long shared prefixes, which must be redundantly encoded for each group member. This inefficiency becomes a major scalability bottleneck in long-context learning scenarios. We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefix computation via a Shared-Prefix Forward strategy. In particular, by restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once, while preserving full differentiability and compatibility with end-to-end training. We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO: it yields identical forward outputs and backward gradients, ensuring that the optimization dynamics and final policy performance remain unchanged. Empirically, our experiments confirm that Prefix Grouper achieves consistent results while significantly reducing the computational cost of training, particularly in long-prefix scenarios. The proposed method is fully plug-and-play: it is compatible with existing GRPO-based architectures and can be seamlessly integrated into current training pipelines as a drop-in replacement, requiring no structural modifications and only minimal changes to input construction and attention computation. Prefix Grouper enables the use of larger group sizes under the same computational budget, thereby improving the scalability of GRPO to more complex tasks and larger models. Code is now available at https://github.com/johncaged/PrefixGrouper

前缀分组器：通过共享前缀前向传播实现高效的GRPO训练

Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward

摘要

Support