WavFlow:波形空间中的音频生成
WavFlow: Audio Generation in Waveform Space
May 18, 2026
作者: Feiyan Zhou, Luyuan Wang, Shoufa Chen, Zhe Wang, Zhiheng Liu, Yuren Cong, Xiaohui Zhang, Fanny Yang, Belinda Zeng
cs.AI
摘要
现代音频生成主要依赖潜在空间压缩,这一过程引入了额外的复杂性并可能导致信息损失。本文提出WavFlow框架,挑战这一范式,直接在原始波形空间中生成高保真音频,无需中间表示。为克服高维低能量信号建模的固有困难,我们通过波形分块将音频重塑为二维标记网格,并引入振幅提升以对齐信号尺度,从而通过流匹配中的直接x预测实现稳定优化。为捕捉复杂的语义对齐和时间同步,我们利用自动化数据管道构建了500万个高质量视频-文本-音频三元组,使模型能够从头学习精细的声学模式。实验结果表明,WavFlow在视频到音频基准VGGSound(FD_PaSST:59.98,IS_PANNs:17.40,DeSync:0.44)和文本到音频基准AudioCaps(FD_PANNs:10.63,IS_PANNs:12.62)上取得了竞争性性能,达到或超越现有潜在空间方法的水平。本研究表明,中间压缩并非高质量合成的先决条件,为多模态音频生成提供了更简单且更可扩展的替代方案。
English
Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.