FoleyGen: 시각적 지도를 통한 오디오 생성

초록

최근 오디오 생성 분야의 발전은 대규모 딥러닝 모델과 방대한 데이터셋의 진화에 힘입어 이루어졌습니다. 그러나 비디오-투-오디오(V2A) 생성 작업은 여전히 도전적인 과제로 남아 있는데, 이는 주로 고차원의 시각적 및 청각적 데이터 간의 복잡한 관계와 시간적 동기화와 관련된 문제 때문입니다. 본 연구에서는 언어 모델링 패러다임을 기반으로 한 오픈 도메인 V2A 생성 시스템인 FoleyGen을 소개합니다. FoleyGen은 웨이브폼과 이산 토큰 간의 양방향 변환을 위해 기성 신경 오디오 코덱을 활용합니다. 오디오 토큰의 생성은 시각적 인코더에서 추출된 시각적 특징에 조건화된 단일 Transformer 모델에 의해 이루어집니다. V2A 생성에서 흔히 발생하는 문제는 생성된 오디오가 비디오의 가시적 동작과 일치하지 않는 것입니다. 이를 해결하기 위해 우리는 세 가지 새로운 시각적 주의 메커니즘을 탐구합니다. 또한, 단일 모달리티 또는 다중 모달리티 작업에 사전 학습된 여러 시각적 인코더를 철저히 평가합니다. VGGSound 데이터셋에 대한 실험 결과는 우리가 제안한 FoleyGen이 모든 객관적 지표와 인간 평가에서 이전 시스템들을 능가함을 보여줍니다.

English

Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. A prevalent problem in V2A generation is the misalignment of generated audio with the visible actions in the video. To address this, we explore three novel visual attention mechanisms. We further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen outperforms previous systems across all objective metrics and human evaluations.

FoleyGen: 시각적 지도를 통한 오디오 생성

FoleyGen: Visually-Guided Audio Generation

초록

Support