チューニング不要な同期型カップリングサンプリングによるマルチイベント長尺動画生成

要旨

近年のテキストからビデオへの拡散モデルの進歩により、単一のプロンプトから高品質な短編ビデオを生成することが可能になりました。しかし、現実世界の長編ビデオを一気に生成することは、データの制約と高い計算コストのため、依然として困難です。この問題に対処するため、いくつかの研究では、既存のモデルを長編ビデオ生成に拡張するチューニング不要のアプローチを提案しています。具体的には、複数のプロンプトを使用して動的で制御されたコンテンツの変更を可能にします。しかし、これらの手法は主に隣接するフレーム間のスムーズな遷移を確保することに焦点を当てており、長いシーケンスではコンテンツのドリフトや意味的一貫性の徐々の喪失を引き起こすことがあります。このような問題を解決するために、我々はSynchronized Coupled Sampling (SynCoS)という新しい推論フレームワークを提案します。SynCoSは、ビデオ全体にわたってノイズ除去パスを同期させ、隣接するフレームだけでなく遠く離れたフレーム間でも長距離の一貫性を確保します。我々のアプローチは、逆サンプリングと最適化ベースのサンプリングという2つの補完的なサンプリング戦略を組み合わせています。これにより、シームレスな局所的な遷移とグローバルな一貫性がそれぞれ確保されます。しかし、これらのサンプリングを直接交互に行うと、ノイズ除去の軌跡がずれ、プロンプトのガイダンスが乱れ、意図しないコンテンツの変更が導入されます。これを解決するために、SynCoSは、グラウンドされたタイムステップと固定されたベースラインのノイズを通じてこれらを同期させ、完全に結合されたサンプリングと整列したノイズ除去パスを確保します。広範な実験により、SynCoSがマルチイベントの長編ビデオ生成を大幅に改善し、よりスムーズな遷移と優れた長距離の一貫性を達成し、従来のアプローチを量的にも質的にも上回ることが示されました。

English

While recent advancements in text-to-video diffusion models enable high-quality short video generation from a single prompt, generating real-world long videos in a single pass remains challenging due to limited data and high computational costs. To address this, several works propose tuning-free approaches, i.e., extending existing models for long video generation, specifically using multiple prompts to allow for dynamic and controlled content changes. However, these methods primarily focus on ensuring smooth transitions between adjacent frames, often leading to content drift and a gradual loss of semantic coherence over longer sequences. To tackle such an issue, we propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video, ensuring long-range consistency across both adjacent and distant frames. Our approach combines two complementary sampling strategies: reverse and optimization-based sampling, which ensure seamless local transitions and enforce global coherence, respectively. However, directly alternating between these samplings misaligns denoising trajectories, disrupting prompt guidance and introducing unintended content changes as they operate independently. To resolve this, SynCoS synchronizes them through a grounded timestep and a fixed baseline noise, ensuring fully coupled sampling with aligned denoising paths. Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence, outperforming previous approaches both quantitatively and qualitatively.

チューニング不要な同期型カップリングサンプリングによるマルチイベント長尺動画生成

Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling

要旨

Support