錯誤中的魔鬼：利用大型語言模型進行細緻機器翻譯評估

摘要

機器翻譯（MT）的自動評估是推動MT系統快速迭代發展的關鍵工具。儘管在估計單一標量質量分數方面取得了相當大的進展，但目前的指標缺乏像多維質量度量（MQM）這樣標註單個錯誤的更詳細方案的信息量。本文通過提出AutoMQM來填補這一空白，這是一種利用大型語言模型（LLMs）的推理和上下文學習能力，要求它們識別和分類翻譯中的錯誤的提示技術。我們首先通過簡單的分數預測提示來評估最近的LLMs，如PaLM和PaLM-2，並通過上下文學習和微調來研究標記數據的影響。然後，我們使用PaLM-2模型評估AutoMQM，發現與僅提示分數相比（尤其是對於更大的模型），它提高了性能，同時通過與人類標註相符的錯誤範圍提供了可解釋性。

English

Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by proposing AutoMQM, a prompting technique which leverages the reasoning and in-context learning capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple score prediction prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.

錯誤中的魔鬼：利用大型語言模型進行細緻機器翻譯評估

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

摘要

Support