多様性と整合性を備えた音声から映像への生成：テキストから映像へのモデル適応を介して

要旨

多様で現実的な動画を、広範な意味的クラスにわたる自然な音声サンプルに基づいて生成するタスクを考察する。このタスクにおいて、生成される動画は入力音声と全体的にも時間的にも整合している必要がある。全体的には、入力音声は出力動画全体と意味的に関連付けられ、時間的には、入力音声の各セグメントはその動画の対応するセグメントと関連付けられる。既存のテキスト条件付き動画生成モデルと事前学習済み音声エンコーダモデルを利用する。提案手法は、軽量なアダプタネットワークに基づいており、音声ベースの表現をテキストから動画を生成するモデルが期待する入力表現にマッピングすることを学習する。これにより、テキスト、音声、そして我々の知る限り初めて、テキストと音声の両方に基づく動画生成が可能となる。本手法を、音声と動画のサンプルが意味的に多様な3つのデータセットで広範に検証し、さらに、生成された動画と入力音声サンプルの整合性を評価する新しい評価指標（AV-Align）を提案する。AV-Alignは、両モダリティにおけるエネルギーピークの検出と比較に基づいている。最近の最先端手法と比較して、本手法は、内容と時間軸の両面において、入力音声により良く整合した動画を生成する。また、本手法によって生成された動画は、視覚的な品質が高く、より多様であることも示す。

English

We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse.

多様性と整合性を備えた音声から映像への生成：テキストから映像へのモデル適応を介して

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

要旨

Support