教師なし単語レベル翻訳品質推定 - アノテータの(非)一致の観点から -

要旨

単語レベル品質推定（WQE）は、機械翻訳の出力における細かいエラースパンを自動的に識別することを目的としており、翻訳後の編集作業を支援するなど多くの用途が見出されています。現代のWQE技術は、大規模言語モデルのプロンプティングや大量の人手によるラベル付きデータを用いたアドホックなトレーニングを必要とするため、しばしば高コストです。本研究では、翻訳モデルの内部動作から翻訳エラーを識別するために、言語モデルの解釈可能性と不確実性定量化の最近の進展を活用した効率的な代替手法を調査します。12の翻訳方向にわたる14のメトリクスを用いた評価において、複数の人間によるラベルセットを使用することで、メトリクスの性能に対する人間のラベル変動の影響を定量化しました。私たちの結果は、教師なしメトリクスの未開拓の可能性、ラベル不確実性に直面した際の教師あり手法の欠点、そして単一アノテーターによる評価手法の脆弱性を浮き彫りにしています。

English

Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

教師なし単語レベル翻訳品質推定 - アノテータの(非)一致の観点から -

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

要旨

Support