SpA2V: 空間的聴覚手がかりを活用したオーディオ駆動型空間認識ビデオ生成

要旨

音声駆動型ビデオ生成は、人間が聴覚入力からシーンを視覚化する能力と同様に、入力音声記録と整合するリアルなビデオを合成することを目指しています。しかし、既存のアプローチは主に、音源のクラスなどの意味情報を探索することに焦点を当てており、正確な内容と空間構成を持つビデオを生成する能力が制限されています。対照的に、私たち人間は、音源の意味的カテゴリを自然に識別できるだけでなく、その位置や移動方向などの深くエンコードされた空間的属性も決定できます。この有用な情報は、音の固有の物理的特性（例えば、音量や周波数）から導出される特定の空間指標を考慮することで明らかにすることができます。従来の方法はこの要素をほとんど無視しているため、私たちはSpA2Vを提案します。これは、音声からこれらの空間的聴覚手がかりを明示的に活用し、高い意味的および空間的対応を持つビデオを生成する初めてのフレームワークです。SpA2Vは生成プロセスを2つの段階に分解します：1）音声ガイド付きビデオ計画：最先端のMLLMを入念に適応させ、入力音声から空間的および意味的手がかりを活用してビデオシーンレイアウト（VSL）を構築する新しいタスクに取り組みます。これは、音声とビデオのモダリティ間のギャップを埋める中間表現として機能します。2）レイアウトに基づくビデオ生成：VSLを条件付きガイダンスとして事前学習済みの拡散モデルにシームレスに統合する効率的で効果的なアプローチを開発し、トレーニング不要でVSLに基づくビデオ生成を可能にします。広範な実験により、SpA2Vが入力音声と意味的および空間的に整合するリアルなビデオを生成するのに優れていることが実証されています。

English

Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: 1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities. 2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.

SpA2V: 空間的聴覚手がかりを活用したオーディオ駆動型空間認識ビデオ生成

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

要旨

Support