通過高級模態條件與互動馴服文本至音視頻生成

摘要

本研究聚焦於一項具挑戰性且前景廣闊的任務——文本至聲畫視頻（Text-to-Sounding-Video, T2SV）生成，其目標是根據文本條件生成包含同步音頻的視頻，同時確保兩種模態均與文本保持一致。儘管在音視頻聯合訓練方面已取得進展，仍有兩大關鍵挑戰亟待解決：（1）單一共享的文本描述，即視頻與音頻共用同一文本，往往會造成模態干擾，混淆預訓練骨幹網絡；（2）跨模態特徵交互的最佳機制尚不明確。為應對這些挑戰，我們首先提出了層次化視覺引導描述生成（Hierarchical Visual-Grounded Captioning, HVGC）框架，該框架生成解耦的視頻描述與音頻描述對，從而在條件設定階段消除干擾。基於HVGC，我們進一步引入了BridgeDiT，一種新穎的雙塔擴散變壓器，它採用雙重交叉注意力（Dual CrossAttention, DCA）機制作為強健的“橋樑”，實現了對稱的雙向信息交換，達成了語義與時間上的同步。在三個基準數據集上的廣泛實驗，輔以人工評估，證明了我們的方法在多數指標上達到了業界領先水平。全面的消融研究進一步驗證了我們貢獻的有效性，為未來T2SV任務提供了關鍵見解。所有代碼與檢查點將公開發布。

English

This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.

通過高級模態條件與互動馴服文本至音視頻生成

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

摘要

Support