ChatPaper.aiChatPaper

网络中有大量内容是机器翻译的:来自多路并行性的见解

A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

January 11, 2024
作者: Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico
cs.AI

摘要

我们发现网络上的内容经常被翻译成多种语言,而这些多语言翻译的质量较低,表明它们很可能是使用机器翻译(MT)生成的。多语言平行、机器生成的内容不仅主导着资源较低的语言中的翻译;它还构成了这些语言中总网络内容的很大一部分。我们还发现了一种选择偏差的证据,即被翻译成多种语言的内容类型存在一致性,这与低质量的英文内容被批量翻译成许多资源较低的语言的情况相符,通过机器翻译。我们的研究引发了对于在网络上抓取的单语和双语数据上训练多语言大型语言模型等模型的严重担忧。
English
We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web.
PDF100December 15, 2024