EPiC：精密なアンカー-ビデオガイダンスによる効率的なビデオカメラ制御学習

要旨

近年のビデオ拡散モデル（VDMs）における3Dカメラ制御のアプローチでは、注釈付きカメラ軌跡に従って推定された点群からレンダリングすることで、構造化された事前情報として拡散モデルを導くためのアンカービデオを作成することが多い。しかし、点群推定に内在する誤差により、不正確なアンカービデオが生成されることが多い。さらに、広範なカメラ軌跡の注釈が必要となるため、リソース要求がさらに増大する。これらの制限に対処するため、本研究では、高価なカメラ軌跡注釈を必要とせずに高品質なアンカービデオを自動的に構築する、効率的かつ精密なカメラ制御学習フレームワークであるEPiCを提案する。具体的には、初フレームの可視性に基づいてソースビデオをマスキングすることで、高精度のアンカービデオをトレーニング用に作成する。このアプローチにより、高い整合性が保証され、カメラ軌跡の注釈が不要となるため、任意の実世界のビデオに容易に適用して画像からビデオ（I2V）のトレーニングペアを生成できる。さらに、アンカービデオのガイダンスを可視領域に統合する軽量な条件付けモジュールであるAnchor-ControlNetを導入し、バックボーンモデルのパラメータの1%未満で事前学習済みのVDMsに組み込む。提案されたアンカービデオデータとControlNetモジュールを組み合わせることで、EPiCは、レンダリングのミスアライメントを緩和するために通常必要とされる拡散モデルのバックボーンを変更することなく、大幅に少ないパラメータ、トレーニングステップ、およびデータで効率的なトレーニングを実現する。マスキングベースのアンカービデオでトレーニングされているにもかかわらず、本手法は推論時に点群で作成されたアンカービデオに対して頑健に一般化し、精密な3D情報に基づくカメラ制御を可能にする。EPiCは、I2Vカメラ制御タスクにおいてRealEstate10KおよびMiraDataでSOTA性能を達成し、定量的および定性的に精密かつ頑健なカメラ制御能力を示す。特に、EPiCはビデオからビデオへのゼロショット一般化においても強力な性能を発揮する。

English

Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.

EPiC：精密なアンカー-ビデオガイダンスによる効率的なビデオカメラ制御学習

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

要旨

Support