FusionAudio-1.2M：迈向多模态上下文融合的细粒度音频描述

摘要

高质量、大规模的音频描述对于推进音频理解至关重要，然而当前的自动化方法生成的描述往往缺乏细致入微的细节和上下文准确性，这主要归因于它们依赖于有限的单模态或浅层次的多模态信息。受人类听觉感知的启发，人类能够巧妙地整合跨模态线索并进行复杂的听觉场景分析，我们引入了一种新颖的两阶段自动化流程。该流程首先利用专门的预训练模型提取多样化的上下文线索（例如，语音、音乐、一般声音以及相关视频中的视觉信息）。随后，一个大型语言模型（LLM）将这些丰富的多模态输入进行综合，生成详细且上下文感知的音频描述。本工作的主要贡献包括：（1）提出的可扩展的细粒度音频描述生成方法；（2）FusionAudio，一个包含120万条此类详细描述及600万问答对的新大规模数据集；（3）利用FusionAudio开发的增强音频模型，特别是具有卓越音频-文本对齐和指令跟随能力的基于CLAP的音频编码器。本文为更细致、准确地自动化理解复杂音频环境铺平了道路。代码和数据可在https://github.com/satsuki2486441738/FusionAudio 获取。

English

High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments. Code and data can be found in https://github.com/satsuki2486441738/FusionAudio.