Cavia：具有视角集成关注的可控摄像头多视角视频传播

摘要

近年来，在图像到视频生成方面取得了显著突破。然而，生成帧的三维一致性和摄像头可控性仍未解决。最近的研究尝试将摄像头控制纳入生成过程中，但它们的结果通常仅限于简单的轨迹，或者缺乏能够从同一场景的多个不同摄像头路径生成一致视频的能力。为了解决这些限制，我们引入了Cavia，这是一个新颖的框架，用于摄像头可控的多视角视频生成，能够将输入图像转换为多个时空一致的视频。我们的框架将空间和时间注意力模块扩展为视图整合注意力模块，提高了视角和时间一致性。这种灵活的设计允许与多样化的策划数据源进行联合训练，包括场景级静态视频、物体级合成多视角动态视频和现实世界单眼动态视频。据我们所知，Cavia是第一个允许用户在获取物体运动的同时精确指定摄像头运动的框架。大量实验证明，Cavia在几何一致性和感知质量方面超越了最先进的方法。项目页面：https://ir1d.github.io/Cavia/

English

In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To our best knowledge, Cavia is the first of its kind that allows the user to precisely specify camera motion while obtaining object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality. Project Page: https://ir1d.github.io/Cavia/

Cavia：具有视角集成关注的可控摄像头多视角视频传播

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

摘要

Support