エラーに潜む悪魔：大規模言語モデルを活用した細粒度な機械翻訳評価

要旨

機械翻訳（MT）の自動評価は、MTシステムの迅速な反復開発を推進する重要なツールです。単一のスカラー品質スコアを推定する点では大きな進展が見られていますが、現在の評価指標は、Multidimensional Quality Metrics（MQM）のような個々のエラーを注釈する詳細なスキームの情報量に欠けています。本論文では、このギャップを埋めるため、大規模言語モデル（LLM）の推論能力と文脈内学習能力を活用し、翻訳におけるエラーの特定と分類を依頼するプロンプト技術であるAutoMQMを提案します。まず、PaLMやPaLM-2などの最近のLLMを、単純なスコア予測プロンプトを通じて評価し、文脈内学習とファインチューニングを通じたラベル付きデータの影響を調査します。次に、PaLM-2モデルを用いてAutoMQMを評価し、スコアのみをプロンプトする場合と比較して性能が向上すること（特に大規模モデルで大きな改善が見られること）、さらに人間の注釈と整合するエラースパンを通じて解釈可能性が提供されることを確認します。

English

Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by proposing AutoMQM, a prompting technique which leverages the reasoning and in-context learning capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple score prediction prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.

エラーに潜む悪魔：大規模言語モデルを活用した細粒度な機械翻訳評価

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

要旨

Support