MADLAD-400：一个多语言和文档级别的大型经过审核的数据集

摘要

我们介绍了MADLAD-400，这是一个基于CommonCrawl的手动审核的通用领域3T令牌单语数据集，涵盖了419种语言。我们讨论了自审计MADLAD-400所揭示的限制，以及数据审计在数据集创建过程中的作用。然后，我们使用公开可用数据训练并发布了一个包含107亿参数的多语言机器翻译模型，覆盖了超过450种语言，总共2500亿令牌，并发现它与规模显著更大的模型具有竞争力，并在不同领域报告了结果。此外，我们训练了一个包含80亿参数的语言模型，并评估了少样本翻译的结果。我们将基准模型提供给研究社区。

English

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.