OpenSTBench：超越語義評估之語音翻譯

摘要

语音翻译系统日益涵盖语音到文本翻译（S2TT）、语音到语音翻译（S2ST）、离线翻译以及流式生成，其输出在模态、语音实现及时序行为方面存在差异。现有评估实践虽涉及翻译质量、语音及时序质量等重要维度，但这些方面常采用独立协议进行评估，难以全面比较异构系统。为填补这一空白，我们提出OpenSTBench——一个统一的、多维度的评估框架，将异构语音翻译输出组织为共享评估格式。OpenSTBench支持离线与流式场景下的S2TT及S2ST系统，并联合评估翻译质量、语音质量、说话人保留、情感与副语言保真度、时序一致性及延迟。通过在代表性语音翻译系统上的实验，我们表明：翻译质量较强的系统在语音质量及时序质量上仍可能存在显著差异。OpenSTBench提供了可复现的分析协议，用于考察这些跨维度差异，支持面向应用的语音翻译系统比较。代码与数据集可在https://github.com/sjtuayj/OpenSTBench获取。

English

Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (S2ST), offline translation, and streaming generation, producing outputs that differ in modality, speech realization, and timing behavior. Existing evaluation practices assess important aspects such as translation quality, speech quality, and temporal quality, but these aspects are often evaluated under separate protocols, making it difficult to compare heterogeneous systems comprehensively. To address this gap, we present OpenSTBench, a unified multidimensional evaluation framework that organizes heterogeneous speech translation outputs into a shared evaluation format. OpenSTBench supports both S2TT and S2ST systems in offline and streaming settings, and jointly evaluates translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency. Through experiments on representative speech translation systems, we show that systems with strong translation quality can still differ substantially in speech quality, as well as in temporal quality. OpenSTBench provides a reproducible protocol for analyzing these cross-dimensional differences and supporting application-oriented comparison of speech translation systems. The code and datasets are available at https://github.com/sjtuayj/OpenSTBench.