AV-GS：学习材料和几何感知先验用于新视角声学合成。

摘要

新颖视角声学合成（NVAS）旨在在给定三维场景中声源发出的单声道音频的情况下，在任何目标视点生成双耳音频。现有方法提出了基于NeRF的隐式模型，以利用视觉线索作为合成双耳音频的条件。然而，除了由于繁重的NeRF渲染而导致的低效率外，这些方法都具有对整个场景环境（如房间几何形状、材料属性以及听者和声源之间的空间关系）进行表征的能力有限。为了解决这些问题，我们提出了一种新颖的音频-视觉高斯飞溅（AV-GS）模型。为了获得用于音频合成的材料感知和几何感知条件，我们学习了一个显式基于点的场景表示，其中包括一个音频引导参数，该参数在局部初始化的高斯点上考虑了听者和声源之间的空间关系。为了使视觉场景模型具有音频自适应性，我们提出了一种点密集化和修剪策略，以最佳方式分布高斯点，每个点在声音传播中的贡献（例如，对于无纹理墙面，需要更多点，因为它们会影响声音路径的偏离）。大量实验证实了我们的AV-GS在真实世界的RWAS和基于模拟的SoundSpaces数据集上优于现有替代方案。

English

Novel view acoustic synthesis (NVAS) aims to render binaural audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing binaural audio. However, in addition to low efficiency originating from heavy NeRF rendering, these methods all have a limited ability of characterizing the entire scene environment such as room geometry, material properties, and the spatial relation between the listener and sound source. To address these issues, we propose a novel Audio-Visual Gaussian Splatting (AV-GS) model. To obtain a material-aware and geometry-aware condition for audio synthesis, we learn an explicit point-based scene representation with an audio-guidance parameter on locally initialized Gaussian points, taking into account the space relation from the listener and sound source. To make the visual scene model audio adaptive, we propose a point densification and pruning strategy to optimally distribute the Gaussian points, with the per-point contribution in sound propagation (e.g., more points needed for texture-less wall surfaces as they affect sound path diversion). Extensive experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.

AV-GS：学习材料和几何感知先验用于新视角声学合成。

AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

摘要

Support