ChatPaper.aiChatPaper

LEMAS:基于生成式语音模型的大规模可扩展多语言音频套件(15万小时)

LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

January 4, 2026
作者: Zhiyuan Zhao, Lijian Lin, Ye Zhu, Kai Xie, Yunfei Liu, Yu Li
cs.AI

摘要

我们推出LEMAS数据集,据我们所知这是当前最大的开源多语种语音语料库,具备词级时间戳标注。该数据集覆盖10种主要语言,总时长超过15万小时,通过高效的数据处理流程构建,确保高质量数据与标注。为验证LEMAS数据集在不同生成范式下的有效性,我们基于该数据集训练了两种不同架构与任务专长的基准模型。基于非自回归流匹配框架的LEMAS-TTS模型,充分利用数据集的规模优势与语言多样性,实现了鲁棒的零样本多语言合成。我们提出的口音对抗训练与CTC损失函数有效缓解跨语言口音问题,提升合成稳定性。与之互补的LEMAS-Edit模型采用自回归解码器架构,将语音编辑建模为掩码标记填充任务。通过精确的词级对齐信息构建训练掩码,并采用自适应解码策略,该模型实现了边界平滑、过渡自然的无缝语音编辑。实验结果表明,基于LEMAS数据集训练的模型能提供高质量的合成与编辑性能,印证了数据集的优质特性。我们期待这一具备丰富时间戳标注的细粒度多语种语料库,能推动基于提示的语音生成系统未来发展。
English
We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.
PDF01January 10, 2026