ChatPaper.aiChatPaper

長視頻音頻合成與多智能體協作

Long-Video Audio Synthesis with Multi-Agent Collaboration

March 13, 2025
作者: Yehang Zhang, Xinli Xu, Xiaojie Xu, Li Liu, Yingcong Chen
cs.AI

摘要

視訊轉音訊合成技術,為視覺內容生成同步音訊,在電影和互動媒體中極大地提升了觀眾的沉浸感和敘事連貫性。然而,針對長篇內容的視訊轉音訊配音仍是一個未解的難題,原因在於動態語義轉變、時間對齊不準確以及缺乏專用數據集。現有方法雖在短視頻上表現出色,但在長場景(如電影)中卻因合成片段化和跨場景一致性不足而表現欠佳。我們提出了LVAS-Agent,這是一個新穎的多代理框架,通過協作角色專業化來模擬專業配音工作流程。我們的方法將長視頻合成分解為四個步驟,包括場景分割、劇本生成、音效設計和音訊合成。核心創新包括用於場景/劇本精煉的討論-修正機制,以及用於時序-語義對齊的生成-檢索循環。為了實現系統性評估,我們引入了LVAS-Bench,這是首個包含207個專業策劃的長視頻的基準測試,涵蓋多樣化場景。實驗結果顯示,相較於基線方法,我們的技術在音視覺對齊方面表現出顯著優勢。項目頁面:https://lvas-agent.github.io
English
Video-to-audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video-to-audio dubbing for long-form content remains an unsolved challenge due to dynamic semantic shifts, temporal misalignment, and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross-scene consistency. We propose LVAS-Agent, a novel multi-agent framework that emulates professional dubbing workflows through collaborative role specialization. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis. Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment. To enable systematic evaluation, we introduce LVAS-Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments demonstrate superior audio-visual alignment over baseline methods. Project page: https://lvas-agent.github.io

Summary

AI-Generated Summary

PDF93March 18, 2025