어노테이터 (불)일치 관점에서 기계 번역을 위한 비지도 단어 수준 품질 평가

초록

단어 수준 품질 평가(WQE)는 기계 번역 출력물에서 세밀한 오류 범위를 자동으로 식별하는 것을 목표로 하며, 번역 후 편집 과정에서 번역가를 지원하는 등 다양한 용도로 활용되고 있습니다. 현대의 WQE 기술은 대규모 언어 모델을 프롬프팅하거나 방대한 양의 인간 주석 데이터를 사용해 특수 목적으로 훈련하는 등 비용이 많이 드는 경우가 많습니다. 본 연구에서는 언어 모델 해석 가능성과 불확실성 정량화의 최근 발전을 활용하여 번역 모델의 내부 작동으로부터 번역 오류를 식별하는 효율적인 대안을 탐구합니다. 12개 번역 방향에 걸쳐 14개 메트릭을 평가하는 과정에서, 우리는 여러 세트의 인간 주석을 사용하여 인간 주석 변이가 메트릭 성능에 미치는 영향을 정량화했습니다. 우리의 결과는 비지도 메트릭의 잠재력, 레이블 불확실성에 직면했을 때 지도 방법의 한계, 그리고 단일 주석자 평가 관행의 취약성을 강조합니다.

English

Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

어노테이터 (불)일치 관점에서 기계 번역을 위한 비지도 단어 수준 품질 평가

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

초록

Support