如何利用源语言感知神经机器翻译指标评估语音翻译质量
How to Evaluate Speech Translation with Source-Aware Neural MT Metrics
November 5, 2025
作者: Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli
cs.AI
摘要
語音到文本翻譯(ST)系統的自動評估通常通過將翻譯假設與一個或多個參考譯文進行比較來實現。雖然這種方法在一定程度上有效,但它繼承了基於參考評估的局限性,即忽視了源輸入中的寶貴信息。在機器翻譯(MT)領域,最新進展表明,融入源文本的神經度量指標能與人工判斷達成更強的相關性。然而,將這一思路擴展到語音翻譯並非易事,因為源輸入是音頻而非文本,且可靠的轉錄稿或源語與參考譯文的對齊信息通常難以獲取。本研究首次對語音翻譯的源感知度量指標進行系統性探索,特別關注源文本轉錄稿不可用的現實操作場景。我們提出了兩種互補策略來生成輸入音頻的文本代理——自動語音識別(ASR)轉錄稿和參考譯文的回譯文本,並引入一種新穎的兩步跨語言重分詞算法,以解決合成源文本與參考譯文之間的對齊失配問題。在涵蓋79個語言對的兩個語音翻譯基準測試中,針對六種不同架構和性能水平的語音翻譯系統開展的實驗表明:當詞錯誤率低於20%時,ASR轉錄稿構成的合成源文本比回譯文本更可靠;而回譯文本始終是計算成本更低但仍有效的替代方案。此外,我們的跨語言重分詞算法能夠在語音翻譯評估中實現源感知機器翻譯指標的穩健運用,為建立更精準、更系統化的語音翻譯評估方法鋪平道路。
English
Automatic evaluation of speech-to-text translation (ST) systems is typically
performed by comparing translation hypotheses with one or more reference
translations. While effective to some extent, this approach inherits the
limitation of reference-based evaluation that ignores valuable information from
the source input. In machine translation (MT), recent progress has shown that
neural metrics incorporating the source text achieve stronger correlation with
human judgments. Extending this idea to ST, however, is not trivial because the
source is audio rather than text, and reliable transcripts or alignments
between source and references are often unavailable. In this work, we conduct
the first systematic study of source-aware metrics for ST, with a particular
focus on real-world operating conditions where source transcripts are not
available. We explore two complementary strategies for generating textual
proxies of the input audio, automatic speech recognition (ASR) transcripts, and
back-translations of the reference translation, and introduce a novel two-step
cross-lingual re-segmentation algorithm to address the alignment mismatch
between synthetic sources and reference translations. Our experiments, carried
out on two ST benchmarks covering 79 language pairs and six ST systems with
diverse architectures and performance levels, show that ASR transcripts
constitute a more reliable synthetic source than back-translations when word
error rate is below 20%, while back-translations always represent a
computationally cheaper but still effective alternative. Furthermore, our
cross-lingual re-segmentation algorithm enables robust use of source-aware MT
metrics in ST evaluation, paving the way toward more accurate and principled
evaluation methodologies for speech translation.