大規模多言語パラレルデータのためのモデルベース品質評価

要旨

大規模な多言語バイテキストには、しばしば非パラレルな文ペアと低品質な翻訳という2つの明確な問題が含まれている。本稿では、こうしたデータに対するモデルベースの評価を、多言語埋め込みを用いたパラレリズム評価と参照なし品質推定（QE）という2つの独立した構成要素に分解する。パラレリズムについては、FLORES-200およびBOUQuET検索タスクにおいて4つの埋め込みモデルをベンチマークし、我々のターゲット言語ペア目録における6,654のソース–ターゲット方向をカバーした。QEについては、41,412の順序付きソース–ターゲット方向にわたるプロのFLORES-200翻訳に対して、9つの参照なし評価器を評価する。結果は、翻訳方向全体で普遍的に信頼できるモデルは存在しないことを示している。単純なQEアンサンブルは強いモデルの信号を希釈する一方、文書化されたターゲット言語カバレッジは高いQEスコアと強く関連している。全体として、これらの知見は、多言語パラレルデータの評価は、すべての言語において十分に機能する単一の普遍的な指標が期待できない、方向認識型のルーティングおよびキャリブレーション問題として取り組むのが最適であることを示唆している。

English

Large-scale multilingual bitext often contains two distinct problems: non-parallel sentence pairs and low-quality translations. We decompose model-based assessment for such data into two independent components: parallelism assessment with multilingual embeddings and reference-free quality estimation (QE). For parallelism, we benchmark four embedding models on FLORES-200 and BOUQuET retrieval tasks, covering 6,654 source--target directions in our target language-pair inventory. For QE, we evaluate nine reference-free evaluators on professional FLORES-200 translations across 41,412 ordered source--target directions. Results show that no model is universally reliable across translation directions. Naive QE ensembles dilute strong model signals, while documented target-language coverage is strongly associated with higher QE scores. Overall, these findings suggest that multilingual parallel-data assessment is best approached as a direction-aware routing and calibration problem, where no single universal metric is expected to suffice across all languages.