Cavia: ビュー統合アテンションを備えたカメラ制御可能なマルチビュー動画伝搬

要旨

近年、画像から動画への生成において顕著な進展が見られています。しかし、生成されたフレームの3Dの整合性とカメラ制御可能性は未解決のままです。最近の研究では、生成プロセスにカメラ制御を組み込む試みがなされていますが、その結果はしばしば単純な軌跡に限定されたり、同じシーンに対して複数の異なるカメラパスから一貫したビデオを生成する能力が欠けていることがあります。これらの制限に対処するために、私たちはCaviaを導入します。これは、入力画像を複数の時空的に整合したビデオに変換できるカメラ制御可能なマルチビュー動画生成の革新的なフレームワークです。当フレームワークは、空間的および時間的な注意モジュールをビュー統合された注意モジュールに拡張し、視点と時間の整合性の両方を向上させます。この柔軟な設計により、シーンレベルの静的ビデオ、オブジェクトレベルの合成されたマルチビュー動的ビデオ、および実世界の単眼動的ビデオなど、多様なキュレーションされたデータソースと共に共同トレーニングが可能です。私たちの最良の知識によれば、Caviaは、オブジェクトの動きを取得しながらユーザーがカメラの動きを正確に指定できる初めてのものです。包括的な実験により、Caviaが幾何学的整合性と知覚品質の面で最先端の手法を凌駕していることが示されています。プロジェクトページ：https://ir1d.github.io/Cavia/

English

In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To our best knowledge, Cavia is the first of its kind that allows the user to precisely specify camera motion while obtaining object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality. Project Page: https://ir1d.github.io/Cavia/

Cavia: ビュー統合アテンションを備えたカメラ制御可能なマルチビュー動画伝搬

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

要旨

Support