FusionAudio-1.2M：迈向基于多模态上下文融合的细粒度音频描述

摘要

高質量、大規模的音頻描述對於推進音頻理解至關重要，然而當前的自動化方法往往生成的描述缺乏細粒度細節和上下文準確性，這主要歸因於它們依賴於有限的單模態或淺層的多模態信息。受人類聽覺感知的啟發，其巧妙地整合了跨模態線索並進行了複雜的聽覺場景分析，我們引入了一種新穎的兩階段自動化流程。該流程首先利用專門的預訓練模型提取多樣的上下文線索（例如，來自相關視頻的語音、音樂、一般聲音和視覺信息）。隨後，一個大型語言模型（LLM）綜合這些豐富的多模態輸入，生成詳細且上下文感知的音頻描述。本工作的主要貢獻包括：（1）提出的可擴展的細粒度音頻描述生成方法；（2）FusionAudio，一個包含120萬條此類詳細描述及600萬個問答對的新大規模數據集；以及（3）利用FusionAudio開發的增強音頻模型，特別是基於CLAP的音頻編碼器，具有優越的音頻-文本對齊和指令跟隨能力。本文為更細膩和準確地自動理解複雜音頻環境鋪平了道路。代碼和數據可在https://github.com/satsuki2486441738/FusionAudio找到。

English

High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments. Code and data can be found in https://github.com/satsuki2486441738/FusionAudio.