Speech-MASSIVE:一个用于语音理解及更多领域的多语言语音数据集
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
August 7, 2024
作者: Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier
cs.AI
摘要
我们介绍了Speech-MASSIVE,这是一个多语言口语语言理解(SLU)数据集,包括MASSIVE文本语料库的语音对应部分。Speech-MASSIVE涵盖了来自不同语系的12种语言,并继承了MASSIVE的意图预测和槽填充任务的注释。我们的扩展是由于极度多语言SLU数据集的稀缺性以及对评估跨语言和任务的基础模型(LLMs、语音编码器)所需的多功能语音数据集的增长需求。我们提供了一个多模态、多任务、多语言数据集,并在各种训练场景(零-shot、少-shot和完全微调)中使用级联和端到端架构报告了SLU基线。此外,我们展示了Speech-MASSIVE适用于对其他任务进行基准测试,如语音转录、语言识别和语音翻译。该数据集、模型和代码可在以下网址公开获取:https://github.com/hlt-mt/Speech-MASSIVE
English
We present Speech-MASSIVE, a multilingual Spoken Language Understanding (SLU)
dataset comprising the speech counterpart for a portion of the MASSIVE textual
corpus. Speech-MASSIVE covers 12 languages from different families and inherits
from MASSIVE the annotations for the intent prediction and slot-filling tasks.
Our extension is prompted by the scarcity of massively multilingual SLU
datasets and the growing need for versatile speech datasets to assess
foundation models (LLMs, speech encoders) across languages and tasks. We
provide a multimodal, multitask, multilingual dataset and report SLU baselines
using both cascaded and end-to-end architectures in various training scenarios
(zero-shot, few-shot, and full fine-tune). Furthermore, we demonstrate the
suitability of Speech-MASSIVE for benchmarking other tasks such as speech
transcription, language identification, and speech translation. The dataset,
models, and code are publicly available at:
https://github.com/hlt-mt/Speech-MASSIVESummary
AI-Generated Summary