AV-GS:學習材料和幾何感知先驗,用於新視角聲學合成
AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis
June 13, 2024
作者: Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu
cs.AI
摘要
新視角聲學合成(NVAS)旨在在任何目標視點呈現雙耳音頻,假設在3D場景中由聲源發出單聲道音頻。現有方法提出了基於NeRF的隱式模型,以利用視覺線索作為合成雙耳音頻的條件。然而,除了由於繁重的NeRF渲染而產生的效率低下外,這些方法都有限制能力來描述整個場景環境,如房間幾何形狀、材料特性以及聽眾與聲源之間的空間關係。為解決這些問題,我們提出了一個新的音視覺高斯點擴散(AV-GS)模型。為了獲得用於音頻合成的具有材料感知和幾何感知條件,我們學習了一個明確的基於點的場景表示,帶有一個在局部初始化的高斯點上的音頻引導參數,考慮了從聽眾到聲源的空間關係。為了使視覺場景模型具有音頻適應性,我們提出了一種點密集化和修剪策略,以最佳方式分佈高斯點,每個點在聲音傳播中的貢獻(例如,對於無紋理牆面,需要更多點,因為它們影響聲音路徑的轉向)。大量實驗驗證了我們的AV-GS在現實世界RWAS和基於模擬的SoundSpaces數據集上相對現有替代方案的優越性。
English
Novel view acoustic synthesis (NVAS) aims to render binaural audio at any
target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual
cues as a condition for synthesizing binaural audio. However, in addition to
low efficiency originating from heavy NeRF rendering, these methods all have a
limited ability of characterizing the entire scene environment such as room
geometry, material properties, and the spatial relation between the listener
and sound source. To address these issues, we propose a novel Audio-Visual
Gaussian Splatting (AV-GS) model. To obtain a material-aware and geometry-aware
condition for audio synthesis, we learn an explicit point-based scene
representation with an audio-guidance parameter on locally initialized Gaussian
points, taking into account the space relation from the listener and sound
source. To make the visual scene model audio adaptive, we propose a point
densification and pruning strategy to optimally distribute the Gaussian points,
with the per-point contribution in sound propagation (e.g., more points needed
for texture-less wall surfaces as they affect sound path diversion). Extensive
experiments validate the superiority of our AV-GS over existing alternatives on
the real-world RWAS and simulation-based SoundSpaces datasets.Summary
AI-Generated Summary