CamViG: マルチモーダルトランスフォーマーを用いたカメラ認識型画像-動画生成

要旨

我々は、ビデオ生成タスクにおいて3Dカメラモーションを条件付け信号として含むマルチモーダルトランスフォーマーを拡張する。生成ビデオモデルはますます強力になってきており、そのようなモデルの出力を制御する方法に研究の焦点が当てられている。我々は、生成ビデオの過程における3次元カメラ運動のエンコーディングを条件付けとして生成ビデオ手法に仮想3Dカメラ制御を追加することを提案する。結果は、(1)単一フレームとカメラ信号からビデオ生成中にカメラを成功裏に制御できること、(2)従来のコンピュータビジョン手法を用いて生成された3Dカメラパスの精度を示すことを実証している。

English

We extend multimodal transformers to include 3D camera motion as a conditioning signal for the task of video generation. Generative video models are becoming increasingly powerful, thus focusing research efforts on methods of controlling the output of such models. We propose to add virtual 3D camera controls to generative video methods by conditioning generated video on an encoding of three-dimensional camera movement over the course of the generated video. Results demonstrate that we are (1) able to successfully control the camera during video generation, starting from a single frame and a camera signal, and (2) we demonstrate the accuracy of the generated 3D camera paths using traditional computer vision methods.

CamViG: マルチモーダルトランスフォーマーを用いたカメラ認識型画像-動画生成

CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

要旨

Support