Prefix Grouper: Effizientes GRPO-Training durch Shared-Prefix Forward

papers.abstract

Group Relative Policy Optimization (GRPO) verbessert das Policy-Lernen, indem es Gradienten aus relativen Vergleichen zwischen Kandidatenausgaben berechnet, die ein gemeinsames Eingabepräfix teilen. Trotz seiner Effektivität führt GRPO jedoch zu erheblichem Rechenaufwand bei der Verarbeitung langer gemeinsamer Präfixe, die für jedes Gruppenmitglied redundant kodiert werden müssen. Diese Ineffizienz wird zu einem wesentlichen Skalierbarkeitsengpass in Szenarien mit langen Kontexten. Wir schlagen Prefix Grouper vor, einen effizienten GRPO-Trainingsalgorithmus, der redundante Präfixberechnungen durch eine Shared-Prefix Forward-Strategie eliminiert. Insbesondere ermöglicht unsere Methode durch die Umstrukturierung der Selbstaufmerksamkeit in zwei Teile, dass das gemeinsame Präfix nur einmal kodiert wird, während die volle Differenzierbarkeit und Kompatibilität mit dem End-to-End-Training erhalten bleibt. Wir liefern sowohl theoretische als auch empirische Belege dafür, dass Prefix Grouper trainingsäquivalent zum Standard-GRPO ist: Es liefert identische Vorwärtsausgaben und Rückwärtsgradienten, wodurch sichergestellt wird, dass die Optimierungsdynamik und die endgültige Policy-Leistung unverändert bleiben. Empirisch bestätigen unsere Experimente, dass Prefix Grouper konsistente Ergebnisse erzielt und gleichzeitig die Rechenkosten des Trainings erheblich reduziert, insbesondere in Szenarien mit langen Präfixen. Die vorgeschlagene Methode ist vollständig Plug-and-Play: Sie ist kompatibel mit bestehenden GRPO-basierten Architekturen und kann nahtlos in aktuelle Trainingspipelines als Drop-in-Ersatz integriert werden, ohne strukturelle Änderungen und mit nur minimalen Anpassungen an der Eingabekonstruktion und der Aufmerksamkeitsberechnung. Prefix Grouper ermöglicht die Verwendung größerer Gruppengrößen unter dem gleichen Rechenbudget und verbessert so die Skalierbarkeit von GRPO für komplexere Aufgaben und größere Modelle. Der Code ist jetzt verfügbar unter https://github.com/johncaged/PrefixGrouper.

English

Group Relative Policy Optimization (GRPO) enhances policy learning by computing gradients from relative comparisons among candidate outputs that share a common input prefix. Despite its effectiveness, GRPO introduces substantial computational overhead when processing long shared prefixes, which must be redundantly encoded for each group member. This inefficiency becomes a major scalability bottleneck in long-context learning scenarios. We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefix computation via a Shared-Prefix Forward strategy. In particular, by restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once, while preserving full differentiability and compatibility with end-to-end training. We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO: it yields identical forward outputs and backward gradients, ensuring that the optimization dynamics and final policy performance remain unchanged. Empirically, our experiments confirm that Prefix Grouper achieves consistent results while significantly reducing the computational cost of training, particularly in long-prefix scenarios. The proposed method is fully plug-and-play: it is compatible with existing GRPO-based architectures and can be seamlessly integrated into current training pipelines as a drop-in replacement, requiring no structural modifications and only minimal changes to input construction and attention computation. Prefix Grouper enables the use of larger group sizes under the same computational budget, thereby improving the scalability of GRPO to more complex tasks and larger models. Code is now available at https://github.com/johncaged/PrefixGrouper

Prefix Grouper: Effizientes GRPO-Training durch Shared-Prefix Forward

Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward

papers.abstract

Support