直接对齐完整扩散轨迹与细粒度人类偏好
Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
September 8, 2025
作者: Xiangwei Shen, Zhimin Li, Zhantao Yang, Shiyi Zhang, Yingfang Zhang, Donghao Li, Chunyu Wang, Qinglin Lu, Yansong Tang
cs.AI
摘要
近期研究表明,通过可微分奖励直接对齐扩散模型与人类偏好具有显著效果。然而,这类方法面临两大挑战:(1)它们依赖多步去噪过程中的梯度计算进行奖励评分,计算成本高昂,因此优化仅限于少数扩散步骤;(2)为达到理想的美学质量,如照片级真实感或精确的光照效果,往往需要持续离线调整奖励模型。针对多步去噪的限制,我们提出了Direct-Align方法,该方法预先定义噪声先验,通过插值有效恢复任意时间步的原始图像,利用扩散状态是噪声与目标图像间插值的特性,有效避免了后期时间步的过度优化。此外,我们引入了语义相对偏好优化(SRPO),将奖励构建为文本条件信号。这一方法能够在线响应正负提示增强调整奖励,从而减少对离线奖励微调的依赖。通过对FLUX.1.dev模型进行优化去噪与在线奖励调整的微调,我们将其人类评估的真实感与美学质量提升了超过3倍。
English
Recent studies have demonstrated the effectiveness of directly aligning
diffusion models with human preferences using differentiable reward. However,
they exhibit two primary challenges: (1) they rely on multistep denoising with
gradient computation for reward scoring, which is computationally expensive,
thus restricting optimization to only a few diffusion steps; (2) they often
need continuous offline adaptation of reward models in order to achieve desired
aesthetic quality, such as photorealism or precise lighting effects. To address
the limitation of multistep denoising, we propose Direct-Align, a method that
predefines a noise prior to effectively recover original images from any time
steps via interpolation, leveraging the equation that diffusion states are
interpolations between noise and target images, which effectively avoids
over-optimization in late timesteps. Furthermore, we introduce Semantic
Relative Preference Optimization (SRPO), in which rewards are formulated as
text-conditioned signals. This approach enables online adjustment of rewards in
response to positive and negative prompt augmentation, thereby reducing the
reliance on offline reward fine-tuning. By fine-tuning the FLUX.1.dev model
with optimized denoising and online reward adjustment, we improve its
human-evaluated realism and aesthetic quality by over 3x.