MADLAD-400:一个多语言和文档级别的大型经过审核的数据集
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
September 9, 2023
作者: Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat
cs.AI
摘要
我们介绍了MADLAD-400,这是一个基于CommonCrawl的手动审核的通用领域3T令牌单语数据集,涵盖了419种语言。我们讨论了自审计MADLAD-400所揭示的限制,以及数据审计在数据集创建过程中的作用。然后,我们使用公开可用数据训练并发布了一个包含107亿参数的多语言机器翻译模型,覆盖了超过450种语言,总共2500亿令牌,并发现它与规模显著更大的模型具有竞争力,并在不同领域报告了结果。此外,我们训练了一个包含80亿参数的语言模型,并评估了少样本翻译的结果。我们将基准模型提供给研究社区。
English
We introduce MADLAD-400, a manually audited, general domain 3T token
monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss
the limitations revealed by self-auditing MADLAD-400, and the role data
auditing had in the dataset creation process. We then train and release a
10.7B-parameter multilingual machine translation model on 250 billion tokens
covering over 450 languages using publicly available data, and find that it is
competitive with models that are significantly larger, and report the results
on different domains. In addition, we train a 8B-parameter language model, and
assess the results on few-shot translation. We make the baseline models
available to the research community.