QE4PE：人間によるポストエディティングのための単語レベル品質推定

要旨

単語レベルの品質推定（QE）は、機械翻訳の出力における誤りの範囲を検出し、人間によるポストエディットを導き、促進することができます。単語レベルQEシステムの精度は広範に評価されてきましたが、その使用性や、人間のポストエディットの速度、品質、編集選択に対する下流の影響については十分に研究されていません。私たちのQE4PE研究では、2つの翻訳方向にわたる42人のプロフェッショナルポストエディターを巻き込んだ現実的な設定において、単語レベルQEが機械翻訳（MT）のポストエディットに与える影響を調査しました。最先端のニューラルMTモデルの出力における潜在的な誤りを特定するために、教師ありおよび不確実性ベースの単語レベルQE手法を含む4つのエラースパンハイライトモダリティを比較しました。ポストエディットの労力と生産性は行動ログによって推定され、品質の向上は単語レベルおよびセグメントレベルの人間によるアノテーションによって評価されました。ドメイン、言語、エディターの速度がハイライトの有効性を決定する重要な要因であり、人間によるQEハイライトと自動化されたQEハイライトの間に見られるわずかな違いは、プロフェッショナルワークフローにおける精度と使用性のギャップを浮き彫りにしています。

English

Word-level quality estimation (QE) detects erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. Our QE4PE study investigates the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated by behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

QE4PE：人間によるポストエディティングのための単語レベル品質推定

QE4PE: Word-level Quality Estimation for Human Post-Editing

要旨

Support