MiDashengLM：基于通用音频描述的高效音频理解

摘要

当前的大型音频语言模型（LALMs）方法常依赖于封闭数据源或专有模型，这限制了其泛化能力和可访问性。本文介绍了MiDashengLM，一种新颖的开放音频语言模型，旨在通过使用我们新开发的ACAVCaps训练数据集，实现高效且全面的音频理解。MiDashengLM完全依赖公开可用的预训练和监督微调（SFT）数据集，确保了完全的透明度和可复现性。其核心集成了Dasheng，一个开源的音频编码器，专门设计用于有效处理多样化的听觉信息。与以往主要基于自动语音识别（ASR）的音频-文本对齐工作不同，我们的策略聚焦于通用音频描述，将语音、声音和音乐信息融合为一个文本表示，从而实现对复杂音频场景的整体文本描述。最后，MiDashengLM在首词响应时间（TTFT）上实现了高达4倍的加速，并在吞吐量上比同类模型高出20倍。模型检查点可在https://huggingface.co/mispeech/midashenglm-7b 和 https://github.com/xiaomi-research/dasheng-lm 在线获取。

English

Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.

MiDashengLM：基于通用音频描述的高效音频理解

MiDashengLM: Efficient Audio Understanding with General Audio Captions

摘要

Support