通過人類反饋來改善視頻生成

摘要

通過糾正流技術，視頻生成取得了顯著進展，但問題如運動不流暢和視頻與提示之間的不對齊仍然存在。在這項工作中，我們開發了一個系統化流程，利用人類反饋來減輕這些問題並改進視頻生成模型。具體而言，我們首先構建了一個大規模的人類偏好數據集，專注於現代視頻生成模型，並納入跨多維度的成對標註。然後，我們引入了VideoReward，一個多維視頻獎勵模型，並研究標註和各種設計選擇如何影響其獎勵效果。從統一的強化學習角度出發，旨在通過KL正則化來最大化獎勵，我們通過擴展擴散模型中的算法，引入了三種基於流模型的對齊算法。這些包括兩種訓練時策略：直接偏好優化流（Flow-DPO）和獎勵加權回歸流（Flow-RWR），以及一種推理時技術，Flow-NRG，它將獎勵指導直接應用於嘈雜的視頻。實驗結果表明，VideoReward明顯優於現有的獎勵模型，而Flow-DPO相較於Flow-RWR和標準監督微調方法表現更優。此外，Flow-NRG允許用戶在推理過程中為多個目標分配自定義權重，滿足個性化視頻質量需求。項目頁面：https://gongyeliu.github.io/videoalign。

English

Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.

通過人類反饋來改善視頻生成

Improving Video Generation with Human Feedback

摘要

Support