探索流匹配中奖励反向传播的设计空间

摘要

将文本到图像流匹配模型与人类偏好对齐时，通过直接奖励反向传播的方法具有样本效率高的优势，但受到两个已知问题的制约：在当代模型规模下，无法存储整个采样轨迹中的激活值；同时，跨步骤的链式雅可比积会导致奖励梯度反向传播至早期索引时出现膨胀。基于连接器的方法（如LeapAlign）通过用短固定路径替代完整反向轨迹来解决这些问题，凸显了采样与优化之间的有效解耦。然而，梯度质量取决于该短路径对完整展开的近似精度，尤其是在长间隔下。我们提出FlowBP，这是一种统一的替代轨迹框架，将反向轨迹本身视为设计对象。FlowBP保留无梯度的缓存展开用于采样，然后利用缓存和选择性重前向的速度构建轻量级反向替代。这一视角分离了四个选择：奖励模型输入、激活集、积分权重和桥接耦合，并将先前的直接梯度方法恢复为特定设置。在该框架内，我们实例化了三种变体：FlowBP-Sparse采用稀疏欧拉重构，FlowBP-Bridge引入受控桥接耦合，FlowBP-Lagrange提升了跳跃正交的阶数。三者均通过激活集大小限制内存，并将梯度链限制为至多一个雅可比因子。在SD3.5-M、FLUX.1-dev和FLUX.2-Klein-base上，基于偏好、质量和构成性指标，这三种变体在大多数指标上优于直接梯度基线方法。

English

Aligning text-to-image flow matching models with human preferences via direct reward backpropagation is sample-efficient but hampered by two well-known pathologies: activations cannot be stored across the full sampling trajectory at modern model scale, and chained Jacobian products across steps inflate the reward gradient as it travels back to early indices. Connector-based methods, such as LeapAlign, address these issues by replacing the full backward trajectory with a short pinned path, highlighting a useful decoupling between sampling and optimization. However, the quality of the resulting gradient depends on how accurately this short path approximates the full rollout, especially over long intervals. We propose FlowBP, a unified surrogate-trajectory framework that treats the backward trajectory itself as the design object. FlowBP keeps a no-gradient cached rollout for sampling, then builds a lightweight backward surrogate from cached and selectively re-forwarded velocities. This view separates four choices: the reward-model input, active set, integration weights, and bridge coupling, and recovers prior direct-gradient methods as particular settings. Within this framework, we instantiate three variants: FlowBP-Sparse uses sparse Euler reconstruction, FlowBP-Bridge adds controlled bridge coupling, and FlowBP-Lagrange raises the order of leap quadrature. All three bound memory by the active-set size and limit gradient chaining to at most one Jacobian factor. Across SD3.5-M, FLUX.1-dev, and FLUX.2-Klein-base on preference, quality, and compositional metrics, the three variants improve over direct-gradient baselines on most metrics.