LeapAlign：通过构建双步轨迹实现流匹配模型在任意生成步骤的后训练优化

摘要

本文聚焦于流匹配模型与人类偏好的对齐研究。一种前景广阔的方法是通过流匹配的可微分生成过程直接反向传播奖励梯度进行微调。然而，沿长轨迹的反向传播会导致内存成本过高和梯度爆炸问题。因此，直接梯度法难以更新对最终图像全局结构起关键作用的早期生成步骤。针对此问题，我们提出LeapAlign微调方法，该方法通过缩短反向传播路径降低计算成本，实现从奖励到早期生成步骤的直接梯度传播。具体而言，我们通过设计两个连续跳跃将长轨迹缩短为仅两步：每个跳跃跨越多个ODE采样步骤，在单步内预测未来潜变量。通过随机化跳跃的起止时间步，LeapAlign能在任意生成步骤实现高效稳定的模型更新。为更好利用缩短后的轨迹，我们为与长生成路径一致性更高的轨迹分配更高训练权重。为进一步增强梯度稳定性，我们降低大幅值梯度项的权重，而非像以往研究那样完全移除这些项。在对Flux模型进行微调时，LeapAlign在各项指标上均优于最先进的基于GRPO的方法和直接梯度法，实现了更优的图像质量与图文对齐效果。

English

This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.

LeapAlign：通过构建双步轨迹实现流匹配模型在任意生成步骤的后训练优化

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

摘要

Support