ActAnywhere: 被写体認識型ビデオ背景生成

要旨

前景被写体の動きに合わせたビデオ背景の生成は、映画産業や視覚効果コミュニティにとって重要な課題です。このタスクでは、前景被写体の動きや外観に調和し、かつアーティストの創造的意図に沿った背景を合成する必要があります。本論文では、従来は煩雑な手作業を必要としていたこのプロセスを自動化する生成モデル「ActAnywhere」を紹介します。私たちのモデルは、大規模なビデオ拡散モデルの力を活用し、このタスクに特化して設計されています。ActAnywhereは、前景被写体のセグメンテーションシーケンスを入力とし、希望するシーンを記述した画像を条件として受け取り、条件フレームに従いながら現実的な前景と背景の相互作用を持つ一貫性のあるビデオを生成します。私たちは、人間とシーンの相互作用を記録した大規模なビデオデータセットでモデルを訓練しました。広範な評価により、私たちのモデルがベースラインを大幅に上回る優れた性能を発揮することが実証されています。さらに、ActAnywhereが非人間の被写体を含む多様な分布外サンプルにも一般化することを示します。プロジェクトの詳細は、https://actanywhere.github.io をご覧ください。

English

Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io.

ActAnywhere: 被写体認識型ビデオ背景生成

ActAnywhere: Subject-Aware Video Background Generation

要旨

Support