EPiC：基於精確錨點-視頻引導的高效視頻攝像控制學習

摘要

近期，在视频扩散模型（VDMs）中关于三维相机控制的研究方法常通过渲染基于标注相机轨迹估计的点云来创建锚定视频，以此作为结构化先验引导扩散模型。然而，点云估计中的固有误差往往导致锚定视频不准确。此外，对大量相机轨迹标注的需求进一步增加了资源消耗。针对这些局限，我们提出了EPiC，一种高效且精确的相机控制学习框架，它无需昂贵的相机轨迹标注即可自动构建高质量的锚定视频。具体而言，我们通过基于首帧可见性对源视频进行掩码处理，为训练创建高度精确的锚定视频。这一方法确保了高对齐度，省去了相机轨迹标注的需求，因而可轻松应用于任何野外视频，生成图像到视频（I2V）的训练对。此外，我们引入了Anchor-ControlNet，一个轻量级的条件模块，它将锚定视频在可见区域的引导集成到预训练的VDMs中，其参数量不到主干模型的1%。通过结合提出的锚定视频数据和ControlNet模块，EPiC实现了高效训练，大幅减少了参数量、训练步骤和数据需求，且无需对扩散模型主干进行通常用于缓解渲染错位的修改。尽管我们的方法是在基于掩码的锚定视频上训练的，但在推理时，它能够稳健地泛化到使用点云制作的锚定视频，从而实现精确的三维感知相机控制。EPiC在RealEstate10K和MiraData数据集上的I2V相机控制任务中达到了SOTA性能，定量和定性上均展示了其精确且稳健的相机控制能力。值得注意的是，EPiC在视频到视频场景中也展现出了强大的零样本泛化能力。

English

Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.

EPiC：基於精確錨點-視頻引導的高效視頻攝像控制學習

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

摘要

Support