基于标注者（不）一致性的无监督机器翻译词级质量评估

摘要

词级质量评估（WQE）旨在自动识别机器翻译输出中的细粒度错误片段，并在诸多场景中发挥作用，包括辅助译者在后期编辑过程中进行工作。现代WQE技术往往成本高昂，涉及大型语言模型的提示或针对大量人工标注数据的特定训练。在本研究中，我们探索了利用语言模型可解释性和不确定性量化最新进展的高效替代方案，通过翻译模型内部机制来识别翻译错误。在涵盖12种翻译方向、14项指标的评估中，我们通过使用多组人工标注数据量化了人类标注差异对指标性能的影响。我们的研究结果凸显了无监督指标的未开发潜力，监督方法在面对标注不确定性时的不足，以及单一标注者评估实践的脆弱性。

English

Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

基于标注者（不）一致性的无监督机器翻译词级质量评估

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

摘要

Support