SpA2V: 공간 청각 신호 활용을 통한 오디오 기반 공간 인식 비디오 생성

초록

오디오 기반 비디오 생성은 인간이 청각 입력으로부터 장면을 시각화하는 능력과 유사하게, 입력 오디오 녹음과 일치하는 사실적인 비디오를 합성하는 것을 목표로 합니다. 그러나 기존 접근법은 주로 오디오에 존재하는 소리 발생원의 클래스와 같은 의미론적 정보를 탐색하는 데 초점을 맞추어, 정확한 내용과 공간 구성을 가진 비디오를 생성하는 능력이 제한적입니다. 반면에, 우리 인간은 소리 발생원의 의미론적 범주를 자연스럽게 식별할 뿐만 아니라 위치와 이동 방향과 같은 깊이 인코딩된 공간 속성도 결정할 수 있습니다. 이러한 유용한 정보는 음량이나 주파수와 같은 소리의 고유한 물리적 특성에서 파생된 특정 공간 지표를 고려함으로써 명확히 할 수 있습니다. 기존 방법들은 이 요소를 대부분 무시했기 때문에, 우리는 SpA2V를 제안합니다. 이는 오디오에서 이러한 공간 청각 단서를 명시적으로 활용하여 높은 의미론적 및 공간적 일치를 가진 비디오를 생성하는 최초의 프레임워크입니다. SpA2V는 생성 과정을 두 단계로 분해합니다: 1) 오디오 기반 비디오 계획: 우리는 최신 MLLM을 세심하게 조정하여 입력 오디오에서 공간 및 의미론적 단서를 활용하여 비디오 장면 레이아웃(VSL)을 구성하는 새로운 작업을 수행합니다. 이는 오디오와 비디오 모달리티 간의 격차를 메우기 위한 중간 표현으로 기능합니다. 2) 레이아웃 기반 비디오 생성: 우리는 VSL을 조건부 지침으로 사전 훈련된 확산 모델에 원활하게 통합하는 효율적이고 효과적인 접근 방식을 개발하여, 훈련 없이 VSL 기반 비디오 생성을 가능하게 합니다. 광범위한 실험을 통해 SpA2V가 입력 오디오와 의미론적 및 공간적 정렬을 가진 사실적인 비디오를 생성하는 데 탁월함을 입증합니다.

English

Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: 1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities. 2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.

SpA2V: 공간 청각 신호 활용을 통한 오디오 기반 공간 인식 비디오 생성

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

초록

Support