InstructVideo: 人間のフィードバックを用いたビデオ拡散モデルの指示

要旨

拡散モデルは、ビデオ生成における事実上のパラダイムとして登場しました。しかし、そのウェブスケールの多様な品質のデータへの依存は、視覚的に魅力的でなく、テキストプロンプトと整合しない結果をしばしば生み出します。この問題に対処するため、我々はInstructVideoを提案し、人間のフィードバックによる報酬ファインチューニングを通じてテキストからビデオへの拡散モデルを指導します。InstructVideoには2つの重要な要素があります：1) 完全なDDIMサンプリングチェーンを通じた生成によって引き起こされる報酬ファインチューニングのコストを改善するため、報酬ファインチューニングを編集として再構築します。拡散プロセスを利用してサンプリングされたビデオを破損させることで、InstructVideoはDDIMサンプリングチェーンの部分的な推論のみを必要とし、ファインチューニングのコストを削減しながら効率を向上させます。2) 人間の選好に基づく専用のビデオ報酬モデルの欠如を緩和するため、確立された画像報酬モデル（例：HPSv2）を再利用します。この目的のために、セグメンタルビデオ報酬（Segmental Video Reward）を提案します。これは、セグメンタルなスパースサンプリングに基づいて報酬信号を提供するメカニズムです。また、ファインチューニング中の時間的モデリングの劣化を緩和するための時間的減衰報酬（Temporally Attenuated Reward）を提案します。質的および量的な広範な実験により、InstructVideoにおける画像報酬モデルの実用性と有効性が検証され、汎化能力を損なうことなく生成ビデオの視覚的品質が大幅に向上することが示されました。コードとモデルは公開される予定です。

English

Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available.

InstructVideo: 人間のフィードバックを用いたビデオ拡散モデルの指示

InstructVideo: Instructing Video Diffusion Models with Human Feedback

要旨

Support