MiDashengLM: Efficiënte Audiobegrip met Algemene Audiobijschriften

Samenvatting

Huidige benaderingen voor grote audio-taalmodelen (LALMs) zijn vaak afhankelijk van gesloten databronnen of propriëtaire modellen, wat hun generalisatie en toegankelijkheid beperkt. Dit artikel introduceert MiDashengLM, een nieuw open audio-taalmodel dat is ontworpen voor efficiënte en uitgebreide audio-interpretatie door het gebruik van algemene audiobeschrijvingen met behulp van onze nieuwe ACAVCaps-trainingsdataset. MiDashengLM maakt uitsluitend gebruik van publiek beschikbare pretrainings- en supervised fine-tuning (SFT)-datasets, waardoor volledige transparantie en reproduceerbaarheid worden gegarandeerd. In de kern integreert MiDashengLM Dasheng, een open-source audio-encoder, die specifiek is ontwikkeld om diverse auditieve informatie effectief te verwerken. In tegenstelling tot eerdere werken die zich vooral richtten op audio-tekstuitlijning gebaseerd op automatische spraakherkenning (ASR), richt onze strategie zich op algemene audiobeschrijvingen, waarbij spraak, geluid en muziekinformatie worden samengevoegd tot één tekstuele representatie, wat een holistische tekstuele weergave van complexe audio-scènes mogelijk maakt. Tot slot biedt MiDashengLM een versnelling tot 4x in termen van tijd-tot-eerste-token (TTFT) en tot 20x hogere doorvoer dan vergelijkbare modellen. Checkpoints zijn online beschikbaar op https://huggingface.co/mispeech/midashenglm-7b en https://github.com/xiaomi-research/dasheng-lm.

English

Current approaches for large audio language models (LALMs) often rely on closed data sources or proprietary models, limiting their generalization and accessibility. This paper introduces MiDashengLM, a novel open audio-language model designed for efficient and comprehensive audio understanding through the use of general audio captions using our novel ACAVCaps training dataset. MiDashengLM exclusively relies on publicly available pretraining and supervised fine-tuning (SFT) datasets, ensuring full transparency and reproducibility. At its core, MiDashengLM integrates Dasheng, an open-source audio encoder, specifically engineered to process diverse auditory information effectively. Unlike previous works primarily focused on Automatic Speech Recognition (ASR) based audio-text alignment, our strategy centers on general audio captions, fusing speech, sound and music information into one textual representation, enabling a holistic textual representation of complex audio scenes. Lastly, MiDashengLM provides an up to 4x speedup in terms of time-to-first-token (TTFT) and up to 20x higher throughput than comparable models. Checkpoints are available online at https://huggingface.co/mispeech/midashenglm-7b and https://github.com/xiaomi-research/dasheng-lm.

MiDashengLM: Efficiënte Audiobegrip met Algemene Audiobijschriften

MiDashengLM: Efficient Audio Understanding with General Audio Captions

Samenvatting

Support