텍스트-소리 비디오 생성의 제어: 고급 모달리티 조건 및 상호작용을 통한 접근

초록

본 연구는 텍스트 조건에서 동기화된 오디오를 포함한 비디오를 생성하면서 두 모달리티가 텍스트와 일치하도록 하는 텍스트-소리-비디오(T2SV) 생성이라는 도전적이면서도 유망한 과제에 초점을 맞춥니다. 오디오와 비디오의 공동 학습에서의 진전에도 불구하고, 두 가지 중요한 과제가 여전히 해결되지 않고 있습니다: (1) 비디오와 오디오에 대한 텍스트가 동일한 단일 공유 텍스트 캡션은 종종 모달 간섭을 일으켜 사전 학습된 백본을 혼란스럽게 만들고, (2) 교차 모달 특징 상호작용을 위한 최적의 메커니즘이 여전히 불분명합니다. 이러한 과제를 해결하기 위해, 우리는 먼저 계층적 시각 기반 캡션 생성(HVGC) 프레임워크를 제안합니다. 이 프레임워크는 비디오 캡션과 오디오 캡션으로 구성된 분리된 캡션 쌍을 생성하여 조건 설정 단계에서의 간섭을 제거합니다. HVGC를 기반으로, 우리는 더 나아가 BridgeDiT라는 새로운 듀얼 타워 확산 트랜스포머를 소개합니다. 이 모델은 Dual CrossAttention(DCA) 메커니즘을 사용하여 강력한 "다리" 역할을 하며, 대칭적이고 양방향의 정보 교환을 가능하게 하여 의미적 및 시간적 동기화를 달성합니다. 세 가지 벤치마크 데이터셋에서의 광범위한 실험과 인간 평가를 통해, 우리의 방법이 대부분의 지표에서 최첨단 결과를 달성함을 입증했습니다. 포괄적인 절제 연구는 우리의 기여의 효과를 추가로 검증하며, 향후 T2SV 과제를 위한 중요한 통찰을 제공합니다. 모든 코드와 체크포인트는 공개될 예정입니다.

English

This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.

텍스트-소리 비디오 생성의 제어: 고급 모달리티 조건 및 상호작용을 통한 접근

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

초록

Support