透過標註者(不)一致性視角探討機器翻譯的無監督詞級質量評估
Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
May 29, 2025
作者: Gabriele Sarti, Vilém Zouhar, Malvina Nissim, Arianna Bisazza
cs.AI
摘要
詞級質量評估(WQE)旨在自動識別機器翻譯輸出中的細粒度錯誤片段,並在許多場景中找到了應用,包括在後期編輯過程中協助翻譯人員。現代的WQE技術往往成本高昂,涉及對大型語言模型的提示或基於大量人工標註數據的特定訓練。在本研究中,我們探討了利用語言模型可解釋性和不確定性量化最新進展的高效替代方案,從翻譯模型的內部運作中識別翻譯錯誤。在涵蓋12種翻譯方向的14項指標的評估中,我們通過使用多組人工標註來量化人類標註變異對指標性能的影響。我們的結果凸顯了無監督指標的未開發潛力、監督方法在面對標註不確定性時的不足,以及單一註釋者評估實踐的脆弱性。
English
Word-level quality estimation (WQE) aims to automatically identify
fine-grained error spans in machine-translated outputs and has found many uses,
including assisting translators during post-editing. Modern WQE techniques are
often expensive, involving prompting of large language models or ad-hoc
training on large amounts of human-labeled data. In this work, we investigate
efficient alternatives exploiting recent advances in language model
interpretability and uncertainty quantification to identify translation errors
from the inner workings of translation models. In our evaluation spanning 14
metrics across 12 translation directions, we quantify the impact of human label
variation on metric performance by using multiple sets of human labels. Our
results highlight the untapped potential of unsupervised metrics, the
shortcomings of supervised methods when faced with label uncertainty, and the
brittleness of single-annotator evaluation practices.Summary
AI-Generated Summary