FreeTraj: ビデオ拡散モデルにおけるチューニング不要の軌道制御

要旨

拡散モデルはビデオ生成において顕著な能力を発揮しており、生成プロセスへの軌道制御の導入に対する関心をさらに高めています。既存の研究は主にトレーニングベースの手法（例：条件付きアダプター）に焦点を当てていますが、我々は拡散モデル自体がトレーニングを必要とせずに生成内容を適切に制御できると主張します。本研究では、ノイズ構築とアテンション計算の両方にガイダンスを課すことで、軌道制御可能なビデオ生成を実現するチューニング不要のフレームワークを提案します。具体的には、1) 最初にいくつかの示唆的な現象を示し、初期ノイズが生成内容の運動軌道にどのように影響するかを分析します。2) 次に、ノイズサンプリングとアテンションメカニズムを変更することで軌道制御を可能にするチューニング不要のアプローチであるFreeTrajを提案します。3) さらに、FreeTrajを拡張して、制御可能な軌道を持つより長く大きなビデオ生成を容易にします。これらの設計により、ユーザーは手動で軌道を提供するか、LLM軌道プランナーによって自動生成された軌道を選択する柔軟性を持ちます。広範な実験により、ビデオ拡散モデルの軌道制御性を向上させる我々のアプローチの有効性が検証されました。

English

Diffusion model has demonstrated remarkable capability in video generation, which further sparks interest in introducing trajectory control into the generation process. While existing works mainly focus on training-based methods (e.g., conditional adapter), we argue that diffusion model itself allows decent control over the generated content without requiring any training. In this study, we introduce a tuning-free framework to achieve trajectory-controllable video generation, by imposing guidance on both noise construction and attention computation. Specifically, 1) we first show several instructive phenomenons and analyze how initial noises influence the motion trajectory of generated content. 2) Subsequently, we propose FreeTraj, a tuning-free approach that enables trajectory control by modifying noise sampling and attention mechanisms. 3) Furthermore, we extend FreeTraj to facilitate longer and larger video generation with controllable trajectories. Equipped with these designs, users have the flexibility to provide trajectories manually or opt for trajectories automatically generated by the LLM trajectory planner. Extensive experiments validate the efficacy of our approach in enhancing the trajectory controllability of video diffusion models.

FreeTraj: ビデオ拡散モデルにおけるチューニング不要の軌道制御

FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models

要旨

Support