錯誤中的魔鬼:利用大型語言模型進行細緻機器翻譯評估
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation
August 14, 2023
作者: Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat
cs.AI
摘要
機器翻譯(MT)的自動評估是推動MT系統快速迭代發展的關鍵工具。儘管在估計單一標量質量分數方面取得了相當大的進展,但目前的指標缺乏像多維質量度量(MQM)這樣標註單個錯誤的更詳細方案的信息量。本文通過提出AutoMQM來填補這一空白,這是一種利用大型語言模型(LLMs)的推理和上下文學習能力,要求它們識別和分類翻譯中的錯誤的提示技術。我們首先通過簡單的分數預測提示來評估最近的LLMs,如PaLM和PaLM-2,並通過上下文學習和微調來研究標記數據的影響。然後,我們使用PaLM-2模型評估AutoMQM,發現與僅提示分數相比(尤其是對於更大的模型),它提高了性能,同時通過與人類標註相符的錯誤範圍提供了可解釋性。
English
Automatic evaluation of machine translation (MT) is a critical tool driving
the rapid iterative development of MT systems. While considerable progress has
been made on estimating a single scalar quality score, current metrics lack the
informativeness of more detailed schemes that annotate individual errors, such
as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap
by proposing AutoMQM, a prompting technique which leverages the reasoning and
in-context learning capabilities of large language models (LLMs) and asks them
to identify and categorize errors in translations. We start by evaluating
recent LLMs, such as PaLM and PaLM-2, through simple score prediction
prompting, and we study the impact of labeled data through in-context learning
and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that
it improves performance compared to just prompting for scores (with
particularly large gains for larger models) while providing interpretability
through error spans that align with human annotations.