雙耳流:基於流匹配模型的高質量雙耳語音合成之因果與可串流方法
BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models
May 28, 2025
作者: Susan Liang, Dejan Markovic, Israel D. Gebru, Steven Krenn, Todd Keebler, Jacob Sandakly, Frank Yu, Samuel Hassel, Chenliang Xu, Alexander Richard
cs.AI
摘要
雙耳渲染旨在基於單聲道音頻以及說話者與聽者的位置,合成模擬自然聽覺的雙耳音頻。儘管已有眾多方法被提出以解決此問題,但這些方法在渲染質量和可流式推理方面仍面臨挑戰。要合成與真實錄音難以區分的高質量雙耳音頻,需精確建模雙耳線索、房間混響及環境音。此外,實際應用還要求具備流式推理能力。為應對這些挑戰,我們提出了一種基於流匹配的流式雙耳語音合成框架,名為BinauralFlow。我們將雙耳渲染視為生成問題而非回歸問題,並設計了一種條件流匹配模型以渲染高質量音頻。進一步地,我們設計了一種因果U-Net架構,該架構僅基於過去信息來估計當前音頻幀,從而為流式推理量身定制生成模型。最後,我們引入了一種連續推理管道,結合了流式STFT/ISTFT操作、緩衝區庫、中點求解器及早期跳過計劃,以提升渲染連續性和速度。定量與定性評估均顯示,我們的方法在性能上超越了當前最先進的技術。一項感知研究進一步揭示,我們的模型與真實錄音幾乎難以區分,混淆率達42%。
English
Binaural rendering aims to synthesize binaural audio that mimics natural
hearing based on a mono audio and the locations of the speaker and listener.
Although many methods have been proposed to solve this problem, they struggle
with rendering quality and streamable inference. Synthesizing high-quality
binaural audio that is indistinguishable from real-world recordings requires
precise modeling of binaural cues, room reverb, and ambient sounds.
Additionally, real-world applications demand streaming inference. To address
these challenges, we propose a flow matching based streaming binaural speech
synthesis framework called BinauralFlow. We consider binaural rendering to be a
generation problem rather than a regression problem and design a conditional
flow matching model to render high-quality audio. Moreover, we design a causal
U-Net architecture that estimates the current audio frame solely based on past
information to tailor generative models for streaming inference. Finally, we
introduce a continuous inference pipeline incorporating streaming STFT/ISTFT
operations, a buffer bank, a midpoint solver, and an early skip schedule to
improve rendering continuity and speed. Quantitative and qualitative
evaluations demonstrate the superiority of our method over SOTA approaches. A
perceptual study further reveals that our model is nearly indistinguishable
from real-world recordings, with a 42% confusion rate.