Cavia:具有視角整合注意力的可控攝影機多視角視訊傳播
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention
October 14, 2024
作者: Dejia Xu, Yifan Jiang, Chen Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao, Zhangyang Wang, Hao Tang
cs.AI
摘要
近年來,在影像轉視訊生成方面取得了顯著的突破。然而,生成幀的三維一致性和攝影機可控性仍未解決。最近的研究試圖將攝影機控制納入生成過程中,但其結果通常僅限於簡單的軌跡,或缺乏從同一場景的多個不同攝影機路徑生成一致視訊的能力。為解決這些限制,我們引入了 Cavia,這是一個新穎的框架,用於攝影機可控的多視角視訊生成,能夠將輸入圖像轉換為多個時空一致的視訊。我們的框架將空間和時間注意力模塊擴展為視圖整合的注意力模塊,提高了視角和時間一致性。這種靈活的設計允許與多樣化的精心策劃的數據源進行聯合訓練,包括場景級靜態視訊、對象級合成多視角動態視訊和現實世界的單眼動態視訊。據我們所知,Cavia 是第一個允許用戶在獲得對象運動的同時精確指定攝影機運動的框架。大量實驗表明,Cavia 在幾何一致性和感知質量方面超越了最先進的方法。項目頁面:https://ir1d.github.io/Cavia/
English
In recent years there have been remarkable breakthroughs in image-to-video
generation. However, the 3D consistency and camera controllability of generated
frames have remained unsolved. Recent studies have attempted to incorporate
camera control into the generation process, but their results are often limited
to simple trajectories or lack the ability to generate consistent videos from
multiple distinct camera paths for the same scene. To address these
limitations, we introduce Cavia, a novel framework for camera-controllable,
multi-view video generation, capable of converting an input image into multiple
spatiotemporally consistent videos. Our framework extends the spatial and
temporal attention modules into view-integrated attention modules, improving
both viewpoint and temporal consistency. This flexible design allows for joint
training with diverse curated data sources, including scene-level static
videos, object-level synthetic multi-view dynamic videos, and real-world
monocular dynamic videos. To our best knowledge, Cavia is the first of its kind
that allows the user to precisely specify camera motion while obtaining object
motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art
methods in terms of geometric consistency and perceptual quality. Project Page:
https://ir1d.github.io/Cavia/Summary
AI-Generated Summary