악마는 오류 속에 있다: 대규모 언어 모델을 활용한 세분화된 기계 번역 평가

초록

기계 번역(MT)의 자동 평가는 MT 시스템의 빠른 반복적 개발을 주도하는 중요한 도구입니다. 단일 스칼라 품질 점수를 추정하는 데 있어 상당한 진전이 있었지만, 현재의 평가 지표는 다차원 품질 지표(MQM)와 같이 개별 오류를 주석 처리하는 더 상세한 방식의 정보성을 제공하지 못합니다. 본 논문에서는 이러한 격차를 메우기 위해 대규모 언어 모델(LLM)의 추론 및 문맥 학습 능력을 활용하여 번역에서 오류를 식별하고 분류하도록 요청하는 프롬프팅 기법인 AutoMQM을 제안합니다. 먼저 PaLM 및 PaLM-2와 같은 최신 LLM을 간단한 점수 예측 프롬프팅을 통해 평가하고, 문맥 학습과 미세 조정을 통해 레이블된 데이터의 영향을 연구합니다. 그런 다음 PaLM-2 모델을 사용하여 AutoMQM을 평가한 결과, 단순히 점수를 프롬프팅하는 것보다 성능이 향상되었으며(특히 더 큰 모델에서 큰 향상을 보임), 인간 주석과 일치하는 오류 범위를 통해 해석 가능성을 제공한다는 것을 발견했습니다.

English

Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by proposing AutoMQM, a prompting technique which leverages the reasoning and in-context learning capabilities of large language models (LLMs) and asks them to identify and categorize errors in translations. We start by evaluating recent LLMs, such as PaLM and PaLM-2, through simple score prediction prompting, and we study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores (with particularly large gains for larger models) while providing interpretability through error spans that align with human annotations.

악마는 오류 속에 있다: 대규모 언어 모델을 활용한 세분화된 기계 번역 평가

The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

초록

Support