AV-GS:学习材料和几何感知先验用于新视角声学合成。
AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis
June 13, 2024
作者: Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu
cs.AI
摘要
新颖视角声学合成(NVAS)旨在在给定三维场景中声源发出的单声道音频的情况下,在任何目标视点生成双耳音频。现有方法提出了基于NeRF的隐式模型,以利用视觉线索作为合成双耳音频的条件。然而,除了由于繁重的NeRF渲染而导致的低效率外,这些方法都具有对整个场景环境(如房间几何形状、材料属性以及听者和声源之间的空间关系)进行表征的能力有限。为了解决这些问题,我们提出了一种新颖的音频-视觉高斯飞溅(AV-GS)模型。为了获得用于音频合成的材料感知和几何感知条件,我们学习了一个显式基于点的场景表示,其中包括一个音频引导参数,该参数在局部初始化的高斯点上考虑了听者和声源之间的空间关系。为了使视觉场景模型具有音频自适应性,我们提出了一种点密集化和修剪策略,以最佳方式分布高斯点,每个点在声音传播中的贡献(例如,对于无纹理墙面,需要更多点,因为它们会影响声音路径的偏离)。大量实验证实了我们的AV-GS在真实世界的RWAS和基于模拟的SoundSpaces数据集上优于现有替代方案。
English
Novel view acoustic synthesis (NVAS) aims to render binaural audio at any
target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual
cues as a condition for synthesizing binaural audio. However, in addition to
low efficiency originating from heavy NeRF rendering, these methods all have a
limited ability of characterizing the entire scene environment such as room
geometry, material properties, and the spatial relation between the listener
and sound source. To address these issues, we propose a novel Audio-Visual
Gaussian Splatting (AV-GS) model. To obtain a material-aware and geometry-aware
condition for audio synthesis, we learn an explicit point-based scene
representation with an audio-guidance parameter on locally initialized Gaussian
points, taking into account the space relation from the listener and sound
source. To make the visual scene model audio adaptive, we propose a point
densification and pruning strategy to optimally distribute the Gaussian points,
with the per-point contribution in sound propagation (e.g., more points needed
for texture-less wall surfaces as they affect sound path diversion). Extensive
experiments validate the superiority of our AV-GS over existing alternatives on
the real-world RWAS and simulation-based SoundSpaces datasets.