雙耳流：基於流匹配模型的高質量雙耳語音合成之因果與可串流方法

摘要

雙耳渲染旨在基於單聲道音頻以及說話者與聽者的位置，合成模擬自然聽覺的雙耳音頻。儘管已有眾多方法被提出以解決此問題，但這些方法在渲染質量和可流式推理方面仍面臨挑戰。要合成與真實錄音難以區分的高質量雙耳音頻，需精確建模雙耳線索、房間混響及環境音。此外，實際應用還要求具備流式推理能力。為應對這些挑戰，我們提出了一種基於流匹配的流式雙耳語音合成框架，名為BinauralFlow。我們將雙耳渲染視為生成問題而非回歸問題，並設計了一種條件流匹配模型以渲染高質量音頻。進一步地，我們設計了一種因果U-Net架構，該架構僅基於過去信息來估計當前音頻幀，從而為流式推理量身定制生成模型。最後，我們引入了一種連續推理管道，結合了流式STFT/ISTFT操作、緩衝區庫、中點求解器及早期跳過計劃，以提升渲染連續性和速度。定量與定性評估均顯示，我們的方法在性能上超越了當前最先進的技術。一項感知研究進一步揭示，我們的模型與真實錄音幾乎難以區分，混淆率達42%。

English

Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a 42% confusion rate.

雙耳流：基於流匹配模型的高質量雙耳語音合成之因果與可串流方法

BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

摘要

Support