ThinkSound:多模态大语言模型中的链式思维推理在音频生成与编辑中的应用
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
June 26, 2025
作者: Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, Wei Xue
cs.AI
摘要
尽管端到端的视频到音频生成技术已取得显著进步,但生成能够真实捕捉视觉内容细节的高保真音频仍面临挑战。与创意产业中的专业人士类似,此类生成需要对视觉动态、声学环境及时间关系等要素进行复杂的推理。我们提出了ThinkSound,一个新颖的框架,它利用思维链(Chain-of-Thought, CoT)推理实现逐步、交互式的视频音频生成与编辑。我们的方法将这一过程分解为三个互补阶段:基础拟音生成,创建语义连贯的声景;通过精确的用户交互进行以对象为中心的交互式精炼;以及由自然语言指令引导的针对性编辑。在每个阶段,多模态大语言模型生成上下文对齐的CoT推理,指导统一的音频基础模型。此外,我们引入了AudioCoT,一个包含结构化推理注释的综合数据集,建立了视觉内容、文本描述与声音合成之间的联系。实验表明,ThinkSound在视频到音频生成方面,无论是音频指标还是CoT指标,均达到了最先进的性能,并在分布外的Movie Gen Audio基准测试中表现优异。演示页面可通过https://ThinkSound-Project.github.io访问。
English
While end-to-end video-to-audio generation has greatly improved, producing
high-fidelity audio that authentically captures the nuances of visual content
remains challenging. Like professionals in the creative industries, such
generation requires sophisticated reasoning about items such as visual
dynamics, acoustic environments, and temporal relationships. We present
ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning
to enable stepwise, interactive audio generation and editing for videos. Our
approach decomposes the process into three complementary stages: foundational
foley generation that creates semantically coherent soundscapes, interactive
object-centric refinement through precise user interactions, and targeted
editing guided by natural language instructions. At each stage, a multimodal
large language model generates contextually aligned CoT reasoning that guides a
unified audio foundation model. Furthermore, we introduce AudioCoT, a
comprehensive dataset with structured reasoning annotations that establishes
connections between visual content, textual descriptions, and sound synthesis.
Experiments demonstrate that ThinkSound achieves state-of-the-art performance
in video-to-audio generation across both audio metrics and CoT metrics and
excels in out-of-distribution Movie Gen Audio benchmark. The demo page is
available at https://ThinkSound-Project.github.io.