如何利用源语言感知神经机器翻译指标评估语音翻译
How to Evaluate Speech Translation with Source-Aware Neural MT Metrics
November 5, 2025
作者: Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli
cs.AI
摘要
语音转文本翻译系统的自动评估通常通过将翻译假设与一个或多个参考译文进行比对来实现。这种方法虽在一定程度上有效,但继承了基于参考评估的固有局限——忽略了源输入中的有价值信息。在机器翻译领域,最新研究表明融入源文本的神经度量指标能获得与人工评判更强的一致性。然而将该思路延伸至语音翻译领域存在挑战:源输入为音频而非文本,且可靠的源文本转录或源语与参考译文的对齐信息往往不可得。本研究首次系统探讨了语音翻译的源感知评估方法,重点关注源文本转录不可得的实际应用场景。我们探索了两种互补的生成输入音频文本代理的策略:自动语音识别转录和参考译文回译,并引入一种新颖的两步式跨语言重分段算法以解决合成源文本与参考译文之间的对齐失配问题。在涵盖79个语言对的两个语音翻译基准测试中,针对六种不同架构和性能水平的系统开展实验表明:当词错误率低于20%时,自动语音识别转录比回译文本更能作为可靠的合成源;而回译始终是计算成本更低且仍具效力的替代方案。此外,我们的跨语言重分段算法能够实现源感知机器翻译度量在语音翻译评估中的稳健应用,为建立更精准、更系统的语音翻译评估方法论铺平道路。
English
Automatic evaluation of speech-to-text translation (ST) systems is typically
performed by comparing translation hypotheses with one or more reference
translations. While effective to some extent, this approach inherits the
limitation of reference-based evaluation that ignores valuable information from
the source input. In machine translation (MT), recent progress has shown that
neural metrics incorporating the source text achieve stronger correlation with
human judgments. Extending this idea to ST, however, is not trivial because the
source is audio rather than text, and reliable transcripts or alignments
between source and references are often unavailable. In this work, we conduct
the first systematic study of source-aware metrics for ST, with a particular
focus on real-world operating conditions where source transcripts are not
available. We explore two complementary strategies for generating textual
proxies of the input audio, automatic speech recognition (ASR) transcripts, and
back-translations of the reference translation, and introduce a novel two-step
cross-lingual re-segmentation algorithm to address the alignment mismatch
between synthetic sources and reference translations. Our experiments, carried
out on two ST benchmarks covering 79 language pairs and six ST systems with
diverse architectures and performance levels, show that ASR transcripts
constitute a more reliable synthetic source than back-translations when word
error rate is below 20%, while back-translations always represent a
computationally cheaper but still effective alternative. Furthermore, our
cross-lingual re-segmentation algorithm enables robust use of source-aware MT
metrics in ST evaluation, paving the way toward more accurate and principled
evaluation methodologies for speech translation.