直接對齊完整擴散軌跡與細粒度人類偏好
Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
September 8, 2025
作者: Xiangwei Shen, Zhimin Li, Zhantao Yang, Shiyi Zhang, Yingfang Zhang, Donghao Li, Chunyu Wang, Qinglin Lu, Yansong Tang
cs.AI
摘要
近期研究表明,利用可微分獎勵直接對齊擴散模型與人類偏好具有顯著效果。然而,這些方法面臨兩大挑戰:(1) 依賴於多步去噪並計算梯度來評分獎勵,這在計算上代價高昂,因此將優化限制在僅少數擴散步驟;(2) 為達到理想的美學質量,如照片級真實感或精確的光照效果,往往需要持續離線調整獎勵模型。為解決多步去噪的限制,我們提出了Direct-Align方法,該方法預先定義噪聲先驗,通過插值有效恢復任意時間步的原始圖像,利用擴散狀態是噪聲與目標圖像之間插值的特性,有效避免了在後期時間步的過度優化。此外,我們引入了語義相對偏好優化(SRPO),其中獎勵被構建為文本條件信號。這一方法允許在線調整獎勵以響應正負提示增強,從而減少對離線獎勵微調的依賴。通過對FLUX.1.dev模型進行優化去噪和在線獎勵調整的微調,我們將其人類評估的真實感和美學質量提升了超過3倍。
English
Recent studies have demonstrated the effectiveness of directly aligning
diffusion models with human preferences using differentiable reward. However,
they exhibit two primary challenges: (1) they rely on multistep denoising with
gradient computation for reward scoring, which is computationally expensive,
thus restricting optimization to only a few diffusion steps; (2) they often
need continuous offline adaptation of reward models in order to achieve desired
aesthetic quality, such as photorealism or precise lighting effects. To address
the limitation of multistep denoising, we propose Direct-Align, a method that
predefines a noise prior to effectively recover original images from any time
steps via interpolation, leveraging the equation that diffusion states are
interpolations between noise and target images, which effectively avoids
over-optimization in late timesteps. Furthermore, we introduce Semantic
Relative Preference Optimization (SRPO), in which rewards are formulated as
text-conditioned signals. This approach enables online adjustment of rewards in
response to positive and negative prompt augmentation, thereby reducing the
reliance on offline reward fine-tuning. By fine-tuning the FLUX.1.dev model
with optimized denoising and online reward adjustment, we improve its
human-evaluated realism and aesthetic quality by over 3x.