Speech-MASSIVE：一個用於語音理解及其他用途的多語言語音數據集。

摘要

我們提出了Speech-MASSIVE，這是一個多語言口語語言理解（SLU）數據集，包括MASSIVE文本語料庫的語音對應部分。Speech-MASSIVE涵蓋了來自不同語系的12種語言，並從MASSIVE繼承了意圖預測和槽填充任務的標註。我們擴展了這一數據集，是為了應對極度多語言SLU數據集的稀缺性，以及對評估基礎模型（如LLMs、語音編碼器）跨語言和任務的多功能語音數據集的日益增長需求。我們提供了一個多模態、多任務、多語言的數據集，並在各種訓練情景（零-shot、少-shot和完全微調）中使用串聯和端到端架構報告了SLU基準線。此外，我們展示了Speech-MASSIVE用於其他任務（如語音轉錄、語言識別和語音翻譯）基準測試的適用性。數據集、模型和代碼均可在以下鏈接公開獲取：https://github.com/hlt-mt/Speech-MASSIVE

English

We present Speech-MASSIVE, a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. Our extension is prompted by the scarcity of massively multilingual SLU datasets and the growing need for versatile speech datasets to assess foundation models (LLMs, speech encoders) across languages and tasks. We provide a multimodal, multitask, multilingual dataset and report SLU baselines using both cascaded and end-to-end architectures in various training scenarios (zero-shot, few-shot, and full fine-tune). Furthermore, we demonstrate the suitability of Speech-MASSIVE for benchmarking other tasks such as speech transcription, language identification, and speech translation. The dataset, models, and code are publicly available at: https://github.com/hlt-mt/Speech-MASSIVE

Speech-MASSIVE：一個用於語音理解及其他用途的多語言語音數據集。

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

摘要

Support