迟到总比不到好：同步语音转文本翻译延迟指标评估

摘要

同步语音到文本翻译（SimulST）系统需要在翻译质量与延迟——即语音输入与翻译输出之间的时间差——之间取得平衡。尽管质量评估已有成熟方法，但精确测量延迟仍是一大挑战。现有指标往往产生不一致或误导性的结果，尤其是在广泛使用的短格式场景中，语音被人为预先分割。本文首次对跨语言对、系统以及短格式与长格式场景下的SimulST延迟指标进行了全面分析，揭示了当前指标中与分割相关的结构性偏差，这一偏差影响了公平且有意义的比较。为解决此问题，我们引入了YAAL（Yet Another Average Lagging），一种在短格式场景下提供更准确评估的改进延迟指标。我们将YAAL扩展为LongYAAL以适用于未分割音频，并提出SoftSegmenter，一种基于词级对齐的新型重分割工具。实验表明，YAAL和LongYAAL在延迟指标上优于流行方法，而SoftSegmenter提升了长格式评估中的对齐质量，共同为SimulST系统提供了更可靠的评估手段。

English

Simultaneous speech-to-text translation (SimulST) systems have to balance translation quality with latency--the delay between speech input and the translated output. While quality evaluation is well established, accurate latency measurement remains a challenge. Existing metrics often produce inconsistent or misleading results, especially in the widely used short-form setting, where speech is artificially presegmented. In this paper, we present the first comprehensive analysis of SimulST latency metrics across language pairs, systems, and both short- and long-form regimes. We uncover a structural bias in current metrics related to segmentation that undermines fair and meaningful comparisons. To address this, we introduce YAAL (Yet Another Average Lagging), a refined latency metric that delivers more accurate evaluations in the short-form regime. We extend YAAL to LongYAAL for unsegmented audio and propose SoftSegmenter, a novel resegmentation tool based on word-level alignment. Our experiments show that YAAL and LongYAAL outperform popular latency metrics, while SoftSegmenter enhances alignment quality in long-form evaluation, together enabling more reliable assessments of SimulST systems.

迟到总比不到好：同步语音转文本翻译延迟指标评估

Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

摘要

Support