指導影片:利用人類反饋指導影片擴散模型
InstructVideo: Instructing Video Diffusion Models with Human Feedback
December 19, 2023
作者: Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni
cs.AI
摘要
擴散模型已成為影片生成的事實上範式。然而,它們對於各種品質的網絡規模數據的依賴通常會產生外觀不吸引人且與文本提示不一致的結果。為了應對這個問題,我們提出了InstructVideo,通過獎勵微調利用人類反饋來指導文本到影片的擴散模型。InstructVideo具有兩個關鍵要素:1) 為了改善通過完整DDIM採樣鏈生成所引起的獎勵微調成本,我們將獎勵微調重新定義為編輯。通過利用擴散過程來破壞採樣的影片,InstructVideo僅需要對DDIM採樣鏈進行部分推斷,降低微調成本同時提高微調效率。2) 為了減輕缺乏專用影片獎勵模型用於人類偏好的問題,我們重新運用已建立的圖像獎勵模型,例如HPSv2。為此,我們提出了Segmental Video Reward,一種基於區段稀疏採樣提供獎勵信號的機制,以及Temporally Attenuated Reward,一種在微調期間減輕時間建模退化的方法。廣泛的實驗,無論是質性還是量化的,驗證了在InstructVideo中使用圖像獎勵模型的實用性和功效,顯著提高了生成的影片的視覺質量,同時不損害泛化能力。代碼和模型將公開提供。
English
Diffusion models have emerged as the de facto paradigm for video generation.
However, their reliance on web-scale data of varied quality often yields
results that are visually unappealing and misaligned with the textual prompts.
To tackle this problem, we propose InstructVideo to instruct text-to-video
diffusion models with human feedback by reward fine-tuning. InstructVideo has
two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by
generating through the full DDIM sampling chain, we recast reward fine-tuning
as editing. By leveraging the diffusion process to corrupt a sampled video,
InstructVideo requires only partial inference of the DDIM sampling chain,
reducing fine-tuning cost while improving fine-tuning efficiency. 2) To
mitigate the absence of a dedicated video reward model for human preferences,
we repurpose established image reward models, e.g., HPSv2. To this end, we
propose Segmental Video Reward, a mechanism to provide reward signals based on
segmental sparse sampling, and Temporally Attenuated Reward, a method that
mitigates temporal modeling degradation during fine-tuning. Extensive
experiments, both qualitative and quantitative, validate the practicality and
efficacy of using image reward models in InstructVideo, significantly enhancing
the visual quality of generated videos without compromising generalization
capabilities. Code and models will be made publicly available.