时光余韵:解锁视频到音频生成模型的长度泛化能力
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
February 24, 2026
作者: Christian Simon, Masato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji
cs.AI
摘要
视频与音频的多模态对齐任务面临规模化挑战,这主要源于数据稀缺以及文本描述与帧级视频信息之间的不匹配。本研究针对多模态到音频生成中的扩展难题,探究在短样本上训练的模型能否在测试时泛化至长样本。为此,我们提出名为MMHNet的多模态分层网络——一种对现有视频转音频前沿模型的增强扩展。该方案通过融合分层方法与非因果Mamba架构,实现了长序列音频生成能力。我们的方法显著提升了长音频生成效果,可支持超过5分钟的生成时长。实验证明,在未进行长样本训练的情况下,视频转音频任务中"短训长测"具有可行性。在长视频转音频基准测试中,本方法取得了显著优于现有视频转音频工作的成果。特别值得注意的是,当先前方法难以生成长时序音频时,我们的模型能成功实现超过5分钟的连续生成。
English
Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.