QE4PE:面向人工后编辑的词级质量评估
QE4PE: Word-level Quality Estimation for Human Post-Editing
March 4, 2025
作者: Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, Arianna Bisazza
cs.AI
摘要
詞級質量評估(QE)旨在檢測機器翻譯中的錯誤片段,這可以引導並促進人工後編輯。雖然詞級QE系統的準確性已被廣泛評估,但其可用性以及對人工後編輯速度、質量和編輯選擇的下游影響仍缺乏深入研究。我們的QE4PE研究在涉及42位專業後編輯的兩種翻譯方向的實際場景中,探討了詞級QE對機器翻譯(MT)後編輯的影響。我們比較了四種錯誤片段高亮模式,包括基於監督和基於不確定性的詞級QE方法,用於識別最先進的神經MT模型輸出中的潛在錯誤。後編輯的工作量和生產力通過行為日誌進行估算,而質量改進則通過詞級和段落級的人工註釋進行評估。我們發現,領域、語言和編輯速度是決定高亮效果的重要因素,人工製作和自動化QE高亮之間的微小差異凸顯了專業工作流程中準確性和可用性之間的差距。
English
Word-level quality estimation (QE) detects erroneous spans in machine
translations, which can direct and facilitate human post-editing. While the
accuracy of word-level QE systems has been assessed extensively, their
usability and downstream influence on the speed, quality and editing choices of
human post-editing remain understudied. Our QE4PE study investigates the impact
of word-level QE on machine translation (MT) post-editing in a realistic
setting involving 42 professional post-editors across two translation
directions. We compare four error-span highlight modalities, including
supervised and uncertainty-based word-level QE methods, for identifying
potential errors in the outputs of a state-of-the-art neural MT model.
Post-editing effort and productivity are estimated by behavioral logs, while
quality improvements are assessed by word- and segment-level human annotation.
We find that domain, language and editors' speed are critical factors in
determining highlights' effectiveness, with modest differences between
human-made and automated QE highlights underlining a gap between accuracy and
usability in professional workflows.Summary
AI-Generated Summary