WorldForge: トレーニング不要のガイダンスによるビデオ拡散モデルにおける創発的3D/4D生成の解放

要旨

最近のビデオ拡散モデルは、その豊富な潜在世界事前分布により、空間知能タスクにおいて強い可能性を示しています。しかし、この可能性は制御性の低さと幾何学的不整合によって阻まれており、強力な事前分布と3D/4Dタスクでの実用的な使用との間にギャップが生じています。その結果、現在のアプローチでは再学習やファインチューニングに依存することが多く、事前学習された知識の劣化リスクや高い計算コストが発生しています。この問題に対処するため、我々はWorldForgeを提案します。これは、訓練不要の推論時フレームワークであり、密接に連携した3つのモジュールで構成されています。Intra-Step Recursive Refinementは、推論中に再帰的な最適化メカニズムを導入し、各ノイズ除去ステップ内でネットワークの予測を繰り返し最適化することで、正確な軌道注入を可能にします。Flow-Gated Latent Fusionは、オプティカルフローの類似性を活用して、潜在空間内で動きと外観を分離し、動き関連のチャネルに選択的に軌道ガイダンスを注入します。Dual-Path Self-Corrective Guidanceは、ガイドありとガイドなしのノイズ除去パスを比較し、ノイズの多いまたは不整合な構造信号によって引き起こされる軌道ドリフトを適応的に補正します。これらのコンポーネントを組み合わせることで、訓練なしに細粒度の軌道整合ガイダンスを注入し、正確な動き制御とフォトリアルなコンテンツ生成を両立します。多様なベンチマークにわたる広範な実験により、我々の手法がリアリズム、軌道一貫性、視覚的忠実度において優れていることが検証されました。この研究は、制御可能なビデオ合成のための新しいプラグアンドプレイパラダイムを導入し、空間知能のための生成事前分布の活用に新たな視点を提供します。

English

Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, this potential is hindered by their limited controllability and geometric inconsistency, creating a gap between their strong priors and their practical use in 3D/4D tasks. As a result, current approaches often rely on retraining or fine-tuning, which risks degrading pretrained knowledge and incurs high computational costs. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. Intra-Step Recursive Refinement introduces a recursive refinement mechanism during inference, which repeatedly optimizes network predictions within each denoising step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Extensive experiments across diverse benchmarks validate our method's superiority in realism, trajectory consistency, and visual fidelity. This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.

WorldForge: トレーニング不要のガイダンスによるビデオ拡散モデルにおける創発的3D/4D生成の解放

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

要旨

Support