ChatPaper.aiChatPaper

CamViG:具有多模态Transformer的摄像头感知图像到视频生成

CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

May 21, 2024
作者: Andrew Marmon, Grant Schindler, José Lezama, Dan Kondratyuk, Bryan Seybold, Irfan Essa
cs.AI

摘要

我们将多模态Transformer扩展,以包括3D摄像机运动作为视频生成任务的条件信号。生成式视频模型变得越来越强大,因此研究重点放在控制这些模型输出的方法上。我们建议通过在生成的视频上附加虚拟3D摄像机控制,将三维摄像机运动的编码作为生成视频过程中的条件,以此改进生成式视频方法。结果表明,我们能够(1)成功控制视频生成过程中的摄像机,从单个帧和摄像机信号开始,并且(2)我们展示了使用传统计算机视觉方法验证生成的3D摄像机路径的准确性。
English
We extend multimodal transformers to include 3D camera motion as a conditioning signal for the task of video generation. Generative video models are becoming increasingly powerful, thus focusing research efforts on methods of controlling the output of such models. We propose to add virtual 3D camera controls to generative video methods by conditioning generated video on an encoding of three-dimensional camera movement over the course of the generated video. Results demonstrate that we are (1) able to successfully control the camera during video generation, starting from a single frame and a camera signal, and (2) we demonstrate the accuracy of the generated 3D camera paths using traditional computer vision methods.

Summary

AI-Generated Summary

PDF121December 15, 2024