ChatPaper.aiChatPaper

BinauralFlow:基于流匹配模型的高质量双耳语音合成的因果性与可流式化方法

BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

May 28, 2025
作者: Susan Liang, Dejan Markovic, Israel D. Gebru, Steven Krenn, Todd Keebler, Jacob Sandakly, Frank Yu, Samuel Hassel, Chenliang Xu, Alexander Richard
cs.AI

摘要

双耳渲染旨在基于单声道音频及说话者与听者的位置,合成模拟自然听觉的双耳音频。尽管已有诸多方法尝试解决此问题,但在渲染质量和流式推理方面仍面临挑战。要合成与真实录音难以区分的高质量双耳音频,需精确建模双耳线索、房间混响及环境音。此外,实际应用场景还要求具备流式推理能力。为应对这些挑战,我们提出了一个基于流匹配的流式双耳语音合成框架——BinauralFlow。我们将双耳渲染视为生成问题而非回归问题,并设计了一个条件流匹配模型以渲染高质量音频。同时,我们构建了一种因果U-Net架构,该架构仅依据过去信息估计当前音频帧,从而适配生成模型的流式推理需求。最后,我们引入了一套连续推理流程,整合了流式短时傅里叶变换/逆变换操作、缓冲库、中点求解器及早期跳过策略,以提升渲染的连续性与速度。定量与定性评估均表明,我们的方法在性能上超越了当前最先进的技术。一项感知研究进一步揭示,我们的模型与真实录音的混淆率高达42%,几乎难以分辨。
English
Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a 42% confusion rate.

Summary

AI-Generated Summary

PDF22June 3, 2025