ThinkSound: Ketting-van-Gedachten Redeneren in Multimodale Grote Taalmodellen voor Audiogeneratie en -bewerking

Samenvatting

Hoewel end-to-end video-naar-audio-generatie aanzienlijk is verbeterd, blijft het produceren van hoogwaardige audio die de nuances van visuele inhoud authentiek vastlegt een uitdaging. Net als professionals in de creatieve industrieën vereist dergelijke generatie geavanceerd redeneervermogen over aspecten zoals visuele dynamiek, akoestische omgevingen en temporele relaties. Wij presenteren ThinkSound, een nieuw framework dat gebruikmaakt van Chain-of-Thought (CoT)-redenering om stapsgewijze, interactieve audio-generatie en -bewerking voor video's mogelijk te maken. Onze aanpak deelt het proces op in drie complementaire fasen: fundamentele foley-generatie die semantisch samenhangende soundscapes creëert, interactieve objectgerichte verfijning via precieze gebruikersinteracties, en gerichte bewerking geleid door natuurlijke taal instructies. In elke fase genereert een multimodaal groot taalmodel contextueel afgestemde CoT-redenering die een geïntegreerd audio-foundationmodel aanstuurt. Bovendien introduceren we AudioCoT, een uitgebreide dataset met gestructureerde redeneringsannotaties die verbanden legt tussen visuele inhoud, tekstuele beschrijvingen en geluidssynthese. Experimenten tonen aan dat ThinkSound state-of-the-art prestaties bereikt in video-naar-audio-generatie, zowel op audio- als CoT-metrics, en uitblinkt in de out-of-distribution Movie Gen Audio-benchmark. De demopagina is beschikbaar op https://ThinkSound-Project.github.io.

English

While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark. The demo page is available at https://ThinkSound-Project.github.io.

ThinkSound: Ketting-van-Gedachten Redeneren in Multimodale Grote Taalmodellen voor Audiogeneratie en -bewerking

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Samenvatting

Support