LooseControlVideo: 空間ブロッキングを用いた監督的ビデオ制御

要旨

テキストから動画生成における精密な3D空間オーケストレーションは、特に意味的レイアウトと時間的ダイナミクスがしばしば絡み合うマルチオブジェクトシーンにおいて、依然として重要な課題である。既存の深度条件付きモデルは良好な構造的忠実度を達成するが、変形可能なオブジェクトを含む動的イベントに対しては、フレーム単位の密なガイダンスを必要とし、その作成には多大な労力を要する。我々はLooseControlVideoを提案する。これは、疎な方向性3Dボックスを「ブロッキング」プロキシとして使用することで、直感的かつ表現力豊かな制御を可能にするフレームワークである。これにより、ユーザーは高レベルのレイアウトと軌跡を作成する一方で、動画生成モデルを活用して現実的な遮蔽、ダイナミクス、インタラクションを生成できる。我々はこれを、3Dサイズ、方向、深度順序付けられた遮蔽のための新規エンコーディングであるDNOCSでアノテーションされた動画データセット上でWan 2.2バックボーンを微調整することにより達成する。さらに、本手法は、ジャンプ軌道の調整やインタラクションの追加といった局所的なリファインメントを、大域的なシーンコンテキストへの影響を最小限に抑えながら可能にする。nuScenes、HO-3D、BEHAVEベンチマークでの広範な評価により、LooseControlVideoは既存の2Dボックスやフローベースのベースラインを大幅に上回ることが示された。我々の発見は、軌道誤差において1.2倍から3倍の改善、剛体運動一貫性において2倍の改善、遮蔽精度において1.5倍から2倍の改善を、現在の最先端レイアウト条件付きモデルと比較して示しており、方向性のある3Dプリミティブが複雑なマルチエージェント動画作成において優れた幾何学的事前情報を提供することを実証している。

English

Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.