MusicInfuser:讓視頻擴散模型學會聆聽與舞動
MusicInfuser: Making Video Diffusion Listen and Dance
March 18, 2025
作者: Susung Hong, Ira Kemelmacher-Shlizerman, Brian Curless, Steven M. Seitz
cs.AI
摘要
我們介紹了MusicInfuser,這是一種生成與指定音樂軌道同步的高質量舞蹈視頻的方法。我們並未嘗試設計和訓練新的多模態音視頻模型,而是展示了如何通過引入輕量級的音樂-視頻交叉注意力機制和低秩適配器,使現有的視頻擴散模型能夠與音樂輸入對齊。與先前需要動作捕捉數據的工作不同,我們的方法僅在舞蹈視頻上進行微調。MusicInfuser在保持底層模型靈活性和生成能力的同時,實現了高質量的音樂驅動視頻生成。我們引入了一個使用視頻-LLM的評估框架,以評估舞蹈生成質量的多個維度。項目頁面和代碼可在https://susunghong.github.io/MusicInfuser獲取。
English
We introduce MusicInfuser, an approach for generating high-quality dance
videos that are synchronized to a specified music track. Rather than attempting
to design and train a new multimodal audio-video model, we show how existing
video diffusion models can be adapted to align with musical inputs by
introducing lightweight music-video cross-attention and a low-rank adapter.
Unlike prior work requiring motion capture data, our approach fine-tunes only
on dance videos. MusicInfuser achieves high-quality music-driven video
generation while preserving the flexibility and generative capabilities of the
underlying models. We introduce an evaluation framework using Video-LLMs to
assess multiple dimensions of dance generation quality. The project page and
code are available at https://susunghong.github.io/MusicInfuser.Summary
AI-Generated Summary