通过高级模态条件与交互驾驭文本到声音视频生成

摘要

本研究聚焦于一项具有挑战性但前景广阔的任务——文本到有声视频（T2SV）生成，旨在根据文本条件生成带有同步音频的视频，同时确保两种模态与文本内容对齐。尽管在联合音视频训练方面已取得进展，仍有两个关键挑战亟待解决：（1）单一共享的文本描述，即视频与音频的文本相同，常导致模态干扰，混淆预训练骨干网络；（2）跨模态特征交互的最佳机制尚不明确。为应对这些挑战，我们首先提出了层次化视觉引导描述生成（HVGC）框架，该框架生成解耦的视频描述和音频描述对，在条件阶段消除干扰。基于HVGC，我们进一步引入了BridgeDiT，一种新颖的双塔扩散变换器，采用双交叉注意力（DCA）机制作为稳健的“桥梁”，实现对称、双向的信息交换，达成语义与时间同步。在三个基准数据集上的大量实验，辅以人工评估，证明我们的方法在多数指标上达到了最先进水平。全面的消融研究进一步验证了我们贡献的有效性，为未来T2SV任务提供了关键洞见。所有代码和检查点将公开发布。

English

This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.

通过高级模态条件与交互驾驭文本到声音视频生成

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

摘要

Support