EPiC：基于精准锚点-视频引导的高效摄像机控制学习

摘要

近期在视频扩散模型（VDMs）中的三维相机控制方法通常通过根据标注的相机轨迹从估计的点云渲染来创建锚定视频，以此作为结构化先验来引导扩散模型。然而，点云估计中的固有误差往往导致锚定视频不准确。此外，对大量相机轨迹标注的需求进一步增加了资源消耗。为解决这些限制，我们提出了EPiC，一种高效且精确的相机控制学习框架，能够自动构建高质量的锚定视频，无需昂贵的相机轨迹标注。具体而言，我们基于首帧可见性对源视频进行掩码处理，从而为训练创建高精度的锚定视频。这种方法确保了高度对齐，消除了对相机轨迹标注的需求，因此可以轻松应用于任何野外视频，生成图像到视频（I2V）的训练对。此外，我们引入了Anchor-ControlNet，一个轻量级的条件模块，它将锚定视频在可见区域的引导集成到预训练的VDMs中，其参数量不到主干模型的1%。通过结合提出的锚定视频数据和ControlNet模块，EPiC实现了高效训练，显著减少了参数量、训练步骤和数据需求，且无需对扩散模型主干进行通常用于缓解渲染错位的修改。尽管我们的方法是在基于掩码的锚定视频上训练的，但在推理时能够稳健地泛化到使用点云制作的锚定视频，从而实现精确的三维感知相机控制。EPiC在RealEstate10K和MiraData数据集上的I2V相机控制任务中达到了最先进的性能，定量和定性均展示了精确且稳健的相机控制能力。值得注意的是，EPiC在视频到视频场景中也表现出强大的零样本泛化能力。

English

Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.

EPiC：基于精准锚点-视频引导的高效摄像机控制学习

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

摘要

Support