ChatPaper.aiChatPaper

AutoMV:一种用于音乐视频生成的自动化多智能体系统

AutoMV: An Automatic Multi-Agent System for Music Video Generation

December 13, 2025
作者: Xiaoxuan Tang, Xinping Lei, Chaoran Zhu, Shiyun Chen, Ruibin Yuan, Yizhi Li, Changjae Oh, Ge Zhang, Wenhao Huang, Emmanouil Benetos, Yang Liu, Jiaheng Liu, Yinghao Ma
cs.AI

摘要

針對完整歌曲的音樂到視頻(M2V)生成正面臨重大挑戰。現有方法僅能生成短暫且不連貫的片段,無法實現視覺內容與音樂結構、節拍或歌詞的對齊,且缺乏時間連貫性。本文提出AutoMV——一個可直接從歌曲生成完整音樂視頻(MV)的多智能體系統。該系統首先運用音樂處理工具提取音樂屬性(如結構、人聲軌道及時間對齊歌詞),並將這些特徵構建為後續智能體的上下文輸入。隨後,編劇智能體與導演智能體利用這些信息設計短劇本、在共享外部庫中定義角色檔案,並制定鏡頭指令。這些智能體會調用圖像生成器生成關鍵幀,並根據「故事」或「歌手」場景調用不同視頻生成器。驗證智能體對輸出結果進行評估,通過多智能體協作生成連貫的長篇MV。為評估M2V生成效果,我們進一步提出包含四大高維度類別(音樂內容、技術、後期製作、藝術)及十二項細粒度標準的基準測試。應用該基準對商業產品、AutoMV及人工製作的MV進行專家評測顯示:AutoMV在四大類別中均顯著超越現有基準線,縮小了與專業MV的差距。最後,我們探索使用大型多模態模型作為自動MV評測工具,儘管前景可期,但其表現仍遜於人類專家,凸顯了未來研究的改進空間。
English
Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for "story" or "singer" scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.
PDF52December 17, 2025