基于多智能体协作的长视频音频合成
Long-Video Audio Synthesis with Multi-Agent Collaboration
March 13, 2025
作者: Yehang Zhang, Xinli Xu, Xiaojie Xu, Li Liu, Yingcong Chen
cs.AI
摘要
视频到音频的合成技术,通过为视觉内容生成同步音频,显著提升了电影和互动媒体中观众的沉浸感与叙事连贯性。然而,针对长篇内容的视频到音频配音仍是一个未解的难题,这归因于动态语义变化、时间线错位以及缺乏专门的数据集。尽管现有方法在短视频处理上表现优异,但在长场景(如电影)中却因合成片段化及跨场景一致性不足而受限。我们提出了LVAS-Agent,一种创新的多代理框架,它通过角色分工协作模拟专业配音流程。该方案将长视频合成分解为四个步骤:场景分割、剧本生成、音效设计及音频合成。核心创新包括用于场景/剧本优化的讨论-修正机制,以及实现时间-语义对齐的生成-检索循环。为了系统评估,我们推出了LVAS-Bench,这是首个包含207个专业精选、覆盖多样场景的长视频基准测试集。实验结果表明,相较于基线方法,我们的方法在音视频对齐方面表现更优。项目页面:https://lvas-agent.github.io
English
Video-to-audio synthesis, which generates synchronized audio for visual
content, critically enhances viewer immersion and narrative coherence in film
and interactive media. However, video-to-audio dubbing for long-form content
remains an unsolved challenge due to dynamic semantic shifts, temporal
misalignment, and the absence of dedicated datasets. While existing methods
excel in short videos, they falter in long scenarios (e.g., movies) due to
fragmented synthesis and inadequate cross-scene consistency. We propose
LVAS-Agent, a novel multi-agent framework that emulates professional dubbing
workflows through collaborative role specialization. Our approach decomposes
long-video synthesis into four steps including scene segmentation, script
generation, sound design and audio synthesis. Central innovations include a
discussion-correction mechanism for scene/script refinement and a
generation-retrieval loop for temporal-semantic alignment. To enable systematic
evaluation, we introduce LVAS-Bench, the first benchmark with 207
professionally curated long videos spanning diverse scenarios. Experiments
demonstrate superior audio-visual alignment over baseline methods. Project
page: https://lvas-agent.github.ioSummary
AI-Generated Summary