텍스트-비디오 모델 적응을 통한 다양하고 정렬된 오디오-비디오 생성

초록

우리는 다양한 의미론적 클래스에서 추출한 자연스러운 오디오 샘플에 기반하여 다양하고 사실적인 비디오를 생성하는 작업을 고려한다. 이 작업에서 생성된 비디오는 입력 오디오와 전역적 및 시간적으로 정렬되어야 한다: 전역적으로는 입력 오디오가 전체 출력 비디오와 의미론적으로 연관되어야 하며, 시간적으로는 입력 오디오의 각 세그먼트가 해당 비디오의 세그먼트와 연관되어야 한다. 우리는 기존의 텍스트 조건 비디오 생성 모델과 사전 훈련된 오디오 인코더 모델을 활용한다. 제안된 방법은 경량 어댑터 네트워크를 기반으로 하며, 이 네트워크는 오디오 기반 표현을 텍스트-비디오 생성 모델이 기대하는 입력 표현으로 매핑하는 방법을 학습한다. 이를 통해 텍스트, 오디오, 그리고 우리가 확인한 바에 따르면 처음으로 텍스트와 오디오 모두를 조건으로 하는 비디오 생성이 가능해진다. 우리는 세 가지 데이터셋에서 제안 방법을 광범위하게 검증하며, 오디오-비디오 샘플의 상당한 의미론적 다양성을 입증하고, 생성된 비디오와 입력 오디오 샘플 간의 정렬을 평가하기 위한 새로운 평가 지표(AV-Align)를 제안한다. AV-Align은 두 모달리티에서 에너지 피크의 탐지와 비교를 기반으로 한다. 최신 최첨단 접근법과 비교하여, 우리의 방법은 내용과 시간 축 모두에서 입력 사운드와 더 잘 정렬된 비디오를 생성한다. 또한, 우리의 방법으로 생성된 비디오가 더 높은 시각적 품질과 더 큰 다양성을 보여준다는 것을 입증한다.

English

We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse.

텍스트-비디오 모델 적응을 통한 다양하고 정렬된 오디오-비디오 생성

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

초록

Support