MiDashengLM：汎用音声キャプションを用いた効率的な音声理解

要旨

大規模音声言語モデル（LALMs）の現在のアプローチは、閉じたデータソースや独自モデルに依存することが多く、その汎用性とアクセシビリティが制限されている。本論文では、MiDashengLMを紹介する。これは、我々の新規ACAVCapsトレーニングデータセットを用いた一般的な音声キャプションを通じて、効率的かつ包括的な音声理解を実現するための新規オープン音声言語モデルである。MiDashengLMは、完全な透明性と再現性を確保するために、公開されている事前学習データセットと教師ありファインチューニング（SFT）データセットのみに依存している。その中核には、多様な聴覚情報を効果的に処理するために設計されたオープンソースの音声エンコーダーであるDashengが統合されている。従来の研究が主に自動音声認識（ASR）ベースの音声-テキストアラインメントに焦点を当てていたのに対し、我々の戦略は一般的な音声キャプションに焦点を当て、音声、音響、音楽情報を一つのテキスト表現に融合し、複雑な音声シーンの包括的なテキスト表現を可能にしている。最後に、MiDashengLMは、初回トークンまでの時間（TTFT）において最大4倍の高速化を実現し、比較可能なモデルと比べて最大20倍のスループット向上を提供する。チェックポイントは、https://huggingface.co/mispeech/midashenglm-7b および https://github.com/xiaomi-research/dasheng-lm でオンラインで利用可能である。

English

Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.

MiDashengLM：汎用音声キャプションを用いた効率的な音声理解

MiDashengLM: Efficient Audio Understanding with General Audio Captions

要旨

Support