WavFlow: 파형 공간에서의 오디오 생성

초록

현대 오디오 생성은 주로 잠재 공간 압축에 의존하며, 이는 추가적인 복잡성과 잠재적 정보 손실을 초래한다. 본 연구에서는 중간 표현 없이 원시 파형 공간에서 직접 고충실도 오디오를 생성하는 프레임워크 WavFlow를 통해 이러한 패러다임에 도전한다. 고차원 저에너지 신호를 모델링하는 본질적 어려움을 극복하기 위해, 파형 패치화(waveform patchify)를 통해 오디오를 2D 토큰 그리드로 재구성하고 진폭 리프팅(amplitude lifting)을 도입하여 신호 스케일을 정렬함으로써, 흐름 매칭(flow matching)에서 직접 x-예측을 통한 안정적 최적화를 가능하게 한다. 복잡한 의미 정렬과 시간적 동기화를 포착하기 위해, 자동화된 데이터 파이프라인을 활용하여 500만 개의 고품질 비디오-텍스트-오디오 트리플릿을 큐레이션함으로써 모델이 처음부터 세밀한 음향 패턴을 학습할 수 있도록 한다. 실험 결과, WavFlow는 비디오-투-오디오 벤치마크 VGGSound(FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44)와 텍스트-투-오디오 벤치마크 AudioCaps(FD_PANNs: 10.63, IS_PANNs: 12.62)에서 경쟁력 있는 성능을 달성하며, 기존의 잠재 기반 방법과 동등하거나 이를 능가하는 결과를 보여준다. 본 연구는 중간 압축이 고품질 합성의 전제 조건이 아님을 입증하며, 다중 모달 오디오 생성을 위한 더 간단하고 확장 가능한 대안을 제시한다.

English

Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.