MiDashengLM：基於通用音頻描述的高效音頻理解

摘要

当前，大型音频语言模型（LALMs）的开发多依赖于封闭数据源或专有模型，这限制了其泛化能力与可访问性。本文介绍了一种新型开放音频语言模型——MiDashengLM，该模型旨在通过我们新开发的ACAVCaps训练数据集，利用通用音频字幕实现高效且全面的音频理解。MiDashengLM完全依赖于公开可获取的预训练与监督微调（SFT）数据集，确保了完全的透明度和可复现性。其核心整合了Dasheng，一个开源的音频编码器，专门设计用于高效处理多样化的听觉信息。与以往主要关注基于自动语音识别（ASR）的音频文本对齐工作不同，我们的策略聚焦于通用音频字幕，将语音、声音及音乐信息融合为单一文本表示，从而实现对复杂音频场景的整体文本描述。最后，MiDashengLM在首词生成时间（TTFT）上实现了高达4倍的加速，吞吐量比同类模型高出20倍。模型检查点已在线发布，访问地址为https://huggingface.co/mispeech/midashenglm-7b及https://github.com/xiaomi-research/dasheng-lm。

English

Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.

MiDashengLM：基於通用音頻描述的高效音頻理解

MiDashengLM: Efficient Audio Understanding with General Audio Captions

摘要

Support