ChatPaper.aiChatPaper

網路中被機器翻譯的篇幅驚人:多路平行性的洞見

A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

January 11, 2024
作者: Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico
cs.AI

摘要

我們發現網絡上的內容通常被翻譯成多種語言,而這些多向翻譯的低質量表明它們很可能是使用機器翻譯(MT)生成的。多向平行、機器生成的內容不僅主導著資源較少語言的翻譯;它也佔了這些語言總網絡內容的很大一部分。我們還發現證據表明,在被翻譯成多種語言的內容類型中存在著選擇偏見,這與低質量的英文內容通過機器翻譯大量翻譯成許多資源較少語言的情況一致。我們的研究對於在來自網絡抓取的單語和雙語數據上訓練多語言大型語言模型等模型提出了嚴重擔憂。
English
We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web.
PDF100December 15, 2024