SpA2V：利用空间听觉线索实现音频驱动的空间感知视频生成

摘要

音频驱动视频生成旨在合成与输入音频记录相一致的逼真视频，类似于人类通过听觉输入想象场景的能力。然而，现有方法主要集中于探索语义信息，如音频中发声源的类别，这限制了它们生成内容准确且空间布局合理的视频的能力。相比之下，人类不仅能自然识别发声源的语义类别，还能确定其深度编码的空间属性，包括位置和运动方向。这些有用信息可以通过考虑源自声音固有物理特性（如响度或频率）的具体空间指标来阐明。由于先前的方法大多忽视了这一因素，我们提出了SpA2V，这是首个明确利用音频中的空间听觉线索来生成高语义和空间对应视频的框架。SpA2V将生成过程分解为两个阶段：1）音频引导的视频规划：我们精心调整了一个最先进的多模态大语言模型（MLLM），用于一项新任务，即从输入音频中提取空间和语义线索，构建视频场景布局（VSLs）。这作为中间表示，弥合了音频和视频模态之间的鸿沟。2）基于布局的视频生成：我们开发了一种高效且有效的方法，将VSLs作为条件指导无缝整合到预训练的扩散模型中，实现无需额外训练的VSL引导视频生成。大量实验表明，SpA2V在生成与输入音频语义和空间对齐的逼真视频方面表现出色。

English

Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: 1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities. 2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.

SpA2V：利用空间听觉线索实现音频驱动的空间感知视频生成

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

摘要

Support