MADLAD-400: 多言語ドキュメントレベル大規模監査済みデータセット

要旨

私たちは、CommonCrawlを基にした419言語にわたる3Tトークンの一般領域単一言語データセットであるMADLAD-400を紹介します。このデータセットは手動で監査されており、自己監査によって明らかになった制限事項や、データ監査がデータセット作成プロセスにおいて果たした役割について議論します。次に、公開されているデータを用いて、450以上の言語をカバーする2500億トークンに基づく107億パラメータの多言語機械翻訳モデルをトレーニングし、リリースします。このモデルは、大幅に大規模なモデルと競合することを確認し、異なるドメインでの結果を報告します。さらに、80億パラメータの言語モデルをトレーニングし、少数ショット翻訳における結果を評価します。これらのベースラインモデルを研究コミュニティに公開します。

English

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.

MADLAD-400: 多言語ドキュメントレベル大規模監査済みデータセット

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset

要旨

Support