Free^2Guide: 大視覚言語モデルを用いたテキストからビデオ生成の向上のための勾配フリー経路積分制御

要旨

拡散モデルは、テキストから画像（T2I）やテキストから動画（T2V）の合成などの生成タスクで印象的な成果を上げています。しかし、T2V生成において正確なテキストの整列を実現することは、フレーム間の複雑な時間依存関係のために依然として困難です。既存の強化学習（RL）ベースのアプローチは、テキストの整列を向上させるためにしばしば微分可能な報酬関数を必要とするか、限られたプロンプトに制約されるため、拡張性と適用範囲が制限されます。本論文では、追加のモデルトレーニングを必要とせずに生成された動画をテキストプロンプトに整列させるための新しい勾配フリーフレームワークであるFree^2Guideを提案します。経路積分制御の原則を活用し、Free^2Guideは微分不可能な報酬関数を用いて拡散モデルのためのガイダンスを近似し、強力なブラックボックス大規模ビジョン言語モデル（LVLMs）を報酬モデルとして統合することを可能にします。さらに、当フレームワークは、大規模な画像ベースのモデルを含む複数の報酬モデルを柔軟にアンサンブル化し、計算コストを大幅に増やすことなく整列を協力的に向上させることができます。Free^2Guideが、さまざまな次元でテキストの整列を大幅に改善し、生成された動画の全体的な品質を向上させることを示します。

English

Diffusion models have achieved impressive results in generative tasks like text-to-image (T2I) and text-to-video (T2V) synthesis. However, achieving accurate text alignment in T2V generation remains challenging due to the complex temporal dependency across frames. Existing reinforcement learning (RL)-based approaches to enhance text alignment often require differentiable reward functions or are constrained to limited prompts, hindering their scalability and applicability. In this paper, we propose Free^2Guide, a novel gradient-free framework for aligning generated videos with text prompts without requiring additional model training. Leveraging principles from path integral control, Free^2Guide approximates guidance for diffusion models using non-differentiable reward functions, thereby enabling the integration of powerful black-box Large Vision-Language Models (LVLMs) as reward model. Additionally, our framework supports the flexible ensembling of multiple reward models, including large-scale image-based models, to synergistically enhance alignment without incurring substantial computational overhead. We demonstrate that Free^2Guide significantly improves text alignment across various dimensions and enhances the overall quality of generated videos.