基於模型的大規模多語言平行資料品質評估

摘要

大規模多語種雙文本經常存在兩個明顯問題：非平行句對與低品質翻譯。我們將此類資料的模型評估分解為兩個獨立組件：基於多語言嵌入的平行性評估，以及無參考品質估計。在平行性方面，我們在FLORES-200與BOUQuET檢索任務中對四種嵌入模型進行基準測試，涵蓋目標語言對清單中的6,654個源語言至目標語言方向。在品質估計方面，我們針對專業FLORES-200翻譯（涵蓋41,412個有序源語言至目標語言方向）評估九個無參考評估器。結果顯示，沒有任何模型在所有翻譯方向上均普遍可靠。簡單的品質估計集成會稀釋強模型訊號，而有文件記載的目標語言覆蓋範圍則與較高的品質估計分數密切相關。整體而言，這些發現顯示多語種平行資料的評估最適合視為一個方向感知的路由與校準問題，因為沒有任何單一通用指標能預期適用於所有語言。

English

Large-scale multilingual bitext often contains two distinct problems: non-parallel sentence pairs and low-quality translations. We decompose model-based assessment for such data into two independent components: parallelism assessment with multilingual embeddings and reference-free quality estimation (QE). For parallelism, we benchmark four embedding models on FLORES-200 and BOUQuET retrieval tasks, covering 6,654 source--target directions in our target language-pair inventory. For QE, we evaluate nine reference-free evaluators on professional FLORES-200 translations across 41,412 ordered source--target directions. Results show that no model is universally reliable across translation directions. Naive QE ensembles dilute strong model signals, while documented target-language coverage is strongly associated with higher QE scores. Overall, these findings suggest that multilingual parallel-data assessment is best approached as a direction-aware routing and calibration problem, where no single universal metric is expected to suffice across all languages.