MusicInfuser:让视频扩散模型学会聆听与舞动
MusicInfuser: Making Video Diffusion Listen and Dance
March 18, 2025
作者: Susung Hong, Ira Kemelmacher-Shlizerman, Brian Curless, Steven M. Seitz
cs.AI
摘要
我们推出MusicInfuser,一种生成高质量舞蹈视频的方法,这些视频与指定音乐曲目同步。我们并未尝试设计和训练新的多模态音视频模型,而是展示了如何通过引入轻量级的音乐-视频交叉注意力机制和低秩适配器,使现有视频扩散模型能够与音乐输入对齐。与之前需要动作捕捉数据的工作不同,我们的方法仅对舞蹈视频进行微调。MusicInfuser在保持底层模型灵活性和生成能力的同时,实现了高质量的音乐驱动视频生成。我们引入了一个基于视频-LLM的评估框架,用于衡量舞蹈生成质量的多个维度。项目页面和代码可在https://susunghong.github.io/MusicInfuser获取。
English
We introduce MusicInfuser, an approach for generating high-quality dance
videos that are synchronized to a specified music track. Rather than attempting
to design and train a new multimodal audio-video model, we show how existing
video diffusion models can be adapted to align with musical inputs by
introducing lightweight music-video cross-attention and a low-rank adapter.
Unlike prior work requiring motion capture data, our approach fine-tunes only
on dance videos. MusicInfuser achieves high-quality music-driven video
generation while preserving the flexibility and generative capabilities of the
underlying models. We introduce an evaluation framework using Video-LLMs to
assess multiple dimensions of dance generation quality. The project page and
code are available at https://susunghong.github.io/MusicInfuser.