ThinkSound：音声生成と編集のためのマルチモーダル大規模言語モデルにおける連鎖的思考推論

要旨

エンドツーエンドのビデオからオーディオ生成は大きく進化してきたものの、視覚的コンテンツのニュアンスを忠実に捉えた高忠実度オーディオの生成は依然として課題となっています。クリエイティブ産業の専門家と同様に、このような生成には、視覚的ダイナミクス、音響環境、時間的関係性などについての高度な推論が必要です。本論文では、Chain-of-Thought（CoT）推論を活用し、ビデオに対する段階的でインタラクティブなオーディオ生成と編集を可能にする新しいフレームワーク「ThinkSound」を提案します。我々のアプローチは、プロセスを3つの補完的な段階に分解します：意味的に一貫したサウンドスケープを作成する基礎的なフォーリー生成、正確なユーザーインタラクションを通じたオブジェクト中心のインタラクティブなリファインメント、自然言語指示に基づくターゲット編集です。各段階では、マルチモーダル大規模言語モデルが文脈に沿ったCoT推論を生成し、統一されたオーディオ基盤モデルを導きます。さらに、視覚的コンテンツ、テキスト記述、音響合成の間の関連性を確立する構造化された推論アノテーションを含む包括的なデータセット「AudioCoT」を導入します。実験により、ThinkSoundはオーディオメトリクスとCoTメトリクスの両方においてビデオからオーディオ生成の最先端性能を達成し、分布外のMovie Gen Audioベンチマークでも優れた結果を示すことが実証されました。デモページはhttps://ThinkSound-Project.github.ioで公開されています。

English

While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark. The demo page is available at https://ThinkSound-Project.github.io.