MiDashengLM: 일반 오디오 캡션을 통한 효율적인 오디오 이해

초록

현재 대규모 오디오 언어 모델(LALMs)을 위한 접근 방식은 주로 폐쇄된 데이터 소스나 독점 모델에 의존하고 있어 일반화와 접근성이 제한적이다. 본 논문은 MiDashengLM이라는 새로운 오픈 오디오-언어 모델을 소개한다. 이 모델은 우리가 개발한 새로운 ACAVCaps 학습 데이터셋을 활용하여 일반 오디오 캡션을 통해 효율적이고 포괄적인 오디오 이해를 목표로 한다. MiDashengLM은 공개적으로 이용 가능한 사전 학습 및 지도 미세 조정(SFT) 데이터셋만을 사용함으로써 완전한 투명성과 재현성을 보장한다. MiDashengLM의 핵심에는 다양한 청각 정보를 효과적으로 처리하기 위해 특별히 설계된 오픈소스 오디오 인코더인 Dasheng이 통합되어 있다. 기존 연구가 주로 자동 음성 인식(ASR) 기반 오디오-텍스트 정렬에 초점을 맞췄던 것과 달리, 우리의 전략은 일반 오디오 캡션에 중점을 두어 음성, 소리, 음악 정보를 하나의 텍스트 표현으로 융합함으로써 복잡한 오디오 장면을 포괄적으로 텍스트로 표현할 수 있도록 한다. 마지막으로, MiDashengLM은 첫 토큰까지의 시간(TTFT) 측면에서 최대 4배의 속도 향상과 처리량 측면에서 유사 모델 대비 최대 20배의 성능 향상을 제공한다. 체크포인트는 https://huggingface.co/mispeech/midashenglm-7b와 https://github.com/xiaomi-research/dasheng-lm에서 확인할 수 있다.

English

Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.

MiDashengLM: 일반 오디오 캡션을 통한 효율적인 오디오 이해

MiDashengLM: Efficient Audio Understanding with General Audio Captions

초록

Support