TreeGRPO:基于树形优势的扩散模型在线强化学习后训练GRPO方法
TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
December 9, 2025
作者: Zheng Ding, Weirui Ye
cs.AI
摘要
强化学习(RL)后训练对于将生成模型与人类偏好对齐至关重要,但其高昂的计算成本仍是广泛应用的重大障碍。我们提出TreeGRPO这一新型RL框架,通过将去噪过程重构为搜索树,显著提升训练效率。该方法从共享的初始噪声样本出发,通过策略性分支生成多条候选轨迹,同时高效复用其公共前缀。这种树状结构方法具有三大核心优势:(1)高样本效率,在同等训练样本下实现更优性能;(2)基于奖励反向传播的细粒度信用分配,通过计算逐步骤优势值,克服了基于轨迹方法的均匀信用分配局限;(3)摊销计算成本,利用多子节点分支实现单次前向传播完成多次策略更新。在扩散模型和流模型上的大量实验表明,TreeGRPO在效率-奖励权衡空间中不仅实现2.4倍加速训练,更建立了更优的帕累托边界。该方法在多个基准测试和奖励模型中持续超越GRPO基线,为基于RL的视觉生成模型对齐提供了可扩展的有效路径。项目网站详见treegrpo.github.io。
English
Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce TreeGRPO, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) High sample efficiency, achieving better performance under same training samples (2) Fine-grained credit assignment via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) Amortized computation where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves 2.4times faster training while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.