AutoMV:一种自动化的音乐视频生成多智能体系统
AutoMV: An Automatic Multi-Agent System for Music Video Generation
December 13, 2025
作者: Xiaoxuan Tang, Xinping Lei, Chaoran Zhu, Shiyun Chen, Ruibin Yuan, Yizhi Li, Changjae Oh, Ge Zhang, Wenhao Huang, Emmanouil Benetos, Yang Liu, Jiaheng Liu, Yinghao Ma
cs.AI
摘要
针对完整歌曲的音乐到视频(M2V)生成面临重大挑战。现有方法仅能生成短暂且不连贯的片段,无法实现视觉效果与音乐结构、节拍或歌词的精准对齐,同时缺乏时间连贯性。我们提出AutoMV——一个直接从歌曲生成完整音乐视频(MV)的多智能体系统。该系统首先运用音乐处理工具提取音乐属性(如曲式结构、人声音轨及时间对齐的歌词),并将这些特征构建为后续智能体的上下文输入。随后,编剧智能体与导演智能体基于该信息设计分镜脚本,在共享外部库中定义角色档案,并制定镜头调度方案。这些智能体调用图像生成器制作关键帧,并分别调用"剧情"与"歌手"场景的视频生成器。验证智能体对输出内容进行评估,通过多智能体协作生成连贯的长篇MV。为评估M2V生成效果,我们进一步提出包含四大维度(音乐内容、技术实现、后期制作、艺术表现)和十二项细粒度指标的评测体系。应用该基准对商业产品、AutoMV及人工执导MV进行专家评分显示:AutoMV在四个维度上均显著超越现有基线,缩小了与专业MV的差距。最后,我们探索使用多模态大模型作为自动MV评估工具,虽然前景可观,但其表现仍逊于人类专家,这为未来研究指明了方向。
English
Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for "story" or "singer" scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.