Tri-Prompting：シーン、被写体、動きを統一的に制御するビデオ拡散

要旨

近年、ビデオ拡散モデルは視覚的品質において目覚ましい進歩を遂げているが、精密で細かな制御は、コンテンツ制作における実用的なカスタマイズ性を制限する主要なボトルネックとして残っている。AIビデオ制作者にとって、以下の3つの制御形態が重要である：(i) シーン構成、(ii) マルチビュー一貫性を保った被写体のカスタマイズ、(iii) カメラポーズまたはオブジェクトモーションの調整。既存手法は通常これらの次元を個別に扱い、任意のポーズ変化下でのマルチビュー被写体合成や同一性保持に対するサポートが限られている。この統一されたアーキテクチャの欠如により、多様で連携して制御可能なビデオの実現が困難になっている。本研究では、シーン構成、マルチビュー被写体の一貫性、モーション制御を統合する統一フレームワークおよび二段階トレーニングパラダイムであるTri-Promptingを提案する。本手法は、背景シーンには3Dトラッキングポイントにより、前景被写体にはダウンサンプリングされたRGB手がかりにより駆動されるデュアル条件モーションモジュールを活用する。制御性と視覚的真实性のバランスを確保するため、推論時のControlNetスケールスケジュールをさらに提案する。Tri-Promptingは、任意のシーンへの3D認識被写体挿入や、画像内の既存被写体の操作を含む新しいワークフローをサポートする。実験結果により、Tri-Promptingが、PhantomやDaSなどの専門的なベースラインを、マルチビュー被写体の同一性、3D一貫性、モーション精度において大幅に上回ることを実証する。

English

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

Tri-Prompting：シーン、被写体、動きを統一的に制御するビデオ拡散

Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

要旨

Support