BinauralFlow:基于流匹配模型的高质量双耳语音合成的因果性与可流式化方法
BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models
May 28, 2025
作者: Susan Liang, Dejan Markovic, Israel D. Gebru, Steven Krenn, Todd Keebler, Jacob Sandakly, Frank Yu, Samuel Hassel, Chenliang Xu, Alexander Richard
cs.AI
摘要
双耳渲染旨在基于单声道音频及说话者与听者的位置,合成模拟自然听觉的双耳音频。尽管已有诸多方法尝试解决此问题,但在渲染质量和流式推理方面仍面临挑战。要合成与真实录音难以区分的高质量双耳音频,需精确建模双耳线索、房间混响及环境音。此外,实际应用场景还要求具备流式推理能力。为应对这些挑战,我们提出了一个基于流匹配的流式双耳语音合成框架——BinauralFlow。我们将双耳渲染视为生成问题而非回归问题,并设计了一个条件流匹配模型以渲染高质量音频。同时,我们构建了一种因果U-Net架构,该架构仅依据过去信息估计当前音频帧,从而适配生成模型的流式推理需求。最后,我们引入了一套连续推理流程,整合了流式短时傅里叶变换/逆变换操作、缓冲库、中点求解器及早期跳过策略,以提升渲染的连续性与速度。定量与定性评估均表明,我们的方法在性能上超越了当前最先进的技术。一项感知研究进一步揭示,我们的模型与真实录音的混淆率高达42%,几乎难以分辨。
English
Binaural rendering aims to synthesize binaural audio that mimics natural
hearing based on a mono audio and the locations of the speaker and listener.
Although many methods have been proposed to solve this problem, they struggle
with rendering quality and streamable inference. Synthesizing high-quality
binaural audio that is indistinguishable from real-world recordings requires
precise modeling of binaural cues, room reverb, and ambient sounds.
Additionally, real-world applications demand streaming inference. To address
these challenges, we propose a flow matching based streaming binaural speech
synthesis framework called BinauralFlow. We consider binaural rendering to be a
generation problem rather than a regression problem and design a conditional
flow matching model to render high-quality audio. Moreover, we design a causal
U-Net architecture that estimates the current audio frame solely based on past
information to tailor generative models for streaming inference. Finally, we
introduce a continuous inference pipeline incorporating streaming STFT/ISTFT
operations, a buffer bank, a midpoint solver, and an early skip schedule to
improve rendering continuity and speed. Quantitative and qualitative
evaluations demonstrate the superiority of our method over SOTA approaches. A
perceptual study further reveals that our model is nearly indistinguishable
from real-world recordings, with a 42% confusion rate.Summary
AI-Generated Summary