飞跃对齐：通过构建双步轨迹实现流匹配模型在任意生成步骤的后训练

摘要

本文聚焦于流匹配模型与人类偏好的对齐研究。一种直接通过流匹配可微生成过程反向传播奖励梯度进行微调的方法前景广阔，但长轨迹反向传播会导致内存开销激增和梯度爆炸。因此，直接梯度法难以更新对最终图像全局结构起决定性作用的早期生成步骤。针对该问题，我们提出LeapAlign微调方法，通过缩短反向传播路径降低计算成本，实现奖励信号向早期生成步骤的直接梯度传递。具体而言，我们设计包含两次跳跃的缩短轨迹：每个跳跃跨越多个ODE采样步骤，实现单步预测未来潜在变量。通过随机化跳跃起止时间步，LeapAlign可在任意生成步骤实现高效稳定的模型更新。为优化缩短轨迹的利用效率，我们为与长生成路径一致性更高的轨迹分配更高训练权重。为进一步增强梯度稳定性，我们降低大幅值梯度项的权重（而非如既往研究直接剔除）。在Flux模型微调实验中，LeapAlign在多项指标上持续优于最先进的基于GRPO的方法和直接梯度法，实现了更优的图像质量与图文对齐度。

English

This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.