SpA2V：利用空間聽覺線索實現音頻驅動的空間感知視頻生成

摘要

音訊驅動的視訊生成旨在合成與輸入音訊錄音相符的真實視訊，類似於人類從聽覺輸入中視覺化場景的能力。然而，現有方法主要集中於探索語義信息，例如音訊中聲源的類別，這限制了它們生成具有準確內容和空間構成的視訊的能力。相比之下，我們人類不僅能自然地識別聲源的語義類別，還能確定其深層編碼的空間屬性，包括位置和移動方向。這些有用的信息可以通過考慮源自聲音固有物理特性（如響度或頻率）的特定空間指標來闡明。由於先前的方法大多忽略了這一因素，我們提出了SpA2V，這是第一個明確利用音訊中的空間聽覺線索來生成具有高語義和空間對應性的視訊的框架。SpA2V將生成過程分解為兩個階段：1）音訊引導的視訊規劃：我們精心調整了一種最先進的多模態大語言模型（MLLM），用於從輸入音訊中提取空間和語義線索來構建視訊場景佈局（VSLs）的新任務。這作為一種中間表示，彌合了音訊和視訊模態之間的差距。2）基於佈局的視訊生成：我們開發了一種高效且有效的方法，將VSLs作為條件指導無縫整合到預訓練的擴散模型中，從而實現基於VSL的視訊生成，且無需額外訓練。大量實驗表明，SpA2V在生成與輸入音訊語義和空間對齊的真實視訊方面表現出色。

English

Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: 1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities. 2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.

SpA2V：利用空間聽覺線索實現音頻驅動的空間感知視頻生成

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

摘要

Support