WavFlow: 波形空間における音声生成

要旨

現代の音声生成は主に潜在空間圧縮に依存しており、その結果、追加の複雑さや潜在的な情報損失が生じている。本研究では、この常識に挑戦するWavFlowフレームワークを提案する。これは、中間表現を介さず、生波形空間で直接高忠実度な音声を生成する。高次元かつ低エネルギー信号のモデリングに内在する困難を克服するため、波形パッチ化を通じてオーディオを2次元トークングリッドに再形成し、信号スケールを整合させる振幅リフティングを導入することで、フローマッチングにおける直接的なx予測による安定した最適化を実現する。複雑な意味的整合性と時間的同期を捉えるため、自動データパイプラインを活用して500万件の高品質な映像・テキスト・音声の三つ組データを収集し、モデルがゼロから細粒度の音響パターンを学習できるようにした。実験結果は、WavFlowが映像から音声へのベンチマークVGGSound（FD_PaSST: 59.98、IS_PANNs: 17.40、DeSync: 0.44）およびテキストから音声へのベンチマークAudioCaps（FD_PANNs: 10.63、IS_PANNs: 12.62）において、既存の潜在ベース手法に匹敵またはそれを上回る性能を達成することを示している。本研究は、中間圧縮が高品質合成の前提条件ではないことを実証し、マルチモーダル音声生成に対するよりシンプルでスケーラブルな代替手段を提供する。

English

Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.