ChatPaper.aiChatPaper

雙耳流:基於流匹配模型的高質量雙耳語音合成之因果與可串流方法

BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

May 28, 2025
作者: Susan Liang, Dejan Markovic, Israel D. Gebru, Steven Krenn, Todd Keebler, Jacob Sandakly, Frank Yu, Samuel Hassel, Chenliang Xu, Alexander Richard
cs.AI

摘要

雙耳渲染旨在基於單聲道音頻以及說話者與聽者的位置,合成模擬自然聽覺的雙耳音頻。儘管已有眾多方法被提出以解決此問題,但這些方法在渲染質量和可流式推理方面仍面臨挑戰。要合成與真實錄音難以區分的高質量雙耳音頻,需精確建模雙耳線索、房間混響及環境音。此外,實際應用還要求具備流式推理能力。為應對這些挑戰,我們提出了一種基於流匹配的流式雙耳語音合成框架,名為BinauralFlow。我們將雙耳渲染視為生成問題而非回歸問題,並設計了一種條件流匹配模型以渲染高質量音頻。進一步地,我們設計了一種因果U-Net架構,該架構僅基於過去信息來估計當前音頻幀,從而為流式推理量身定制生成模型。最後,我們引入了一種連續推理管道,結合了流式STFT/ISTFT操作、緩衝區庫、中點求解器及早期跳過計劃,以提升渲染連續性和速度。定量與定性評估均顯示,我們的方法在性能上超越了當前最先進的技術。一項感知研究進一步揭示,我們的模型與真實錄音幾乎難以區分,混淆率達42%。
English
Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a 42% confusion rate.
PDF22June 3, 2025