OpenSTBench：超越语音翻译的语义评估

摘要

语音翻译系统日益涵盖语音到文本翻译（S2TT）、语音到语音翻译（S2ST）、离线翻译及流式生成，其输出在模态、语音实现和时间行为上存在差异。现有评估实践虽关注翻译质量、语音质量和时序质量等重要方面，但这些维度常通过独立协议进行评估，难以全面比较异构系统。为弥补这一不足，我们提出OpenSTBench——一个统一的多维度评估框架，将异质化的语音翻译输出组织为共享评估格式。OpenSTBench同时支持离线与流式场景下的S2TT和S2ST系统，并联合评估翻译质量、语音质量、说话人保留、情感与副语言保真度、时序一致性及延迟。通过对代表性语音翻译系统的实验，我们发现翻译质量优异的系统在语音质量和时序质量上仍可能差异显著。OpenSTBench为分析这些跨维度差异提供了可复现的协议，并支持面向应用的语音翻译系统比较。代码与数据集开源地址：https://github.com/sjtuayj/OpenSTBench。

English

Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (S2ST), offline translation, and streaming generation, producing outputs that differ in modality, speech realization, and timing behavior. Existing evaluation practices assess important aspects such as translation quality, speech quality, and temporal quality, but these aspects are often evaluated under separate protocols, making it difficult to compare heterogeneous systems comprehensively. To address this gap, we present OpenSTBench, a unified multidimensional evaluation framework that organizes heterogeneous speech translation outputs into a shared evaluation format. OpenSTBench supports both S2TT and S2ST systems in offline and streaming settings, and jointly evaluates translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency. Through experiments on representative speech translation systems, we show that systems with strong translation quality can still differ substantially in speech quality, as well as in temporal quality. OpenSTBench provides a reproducible protocol for analyzing these cross-dimensional differences and supporting application-oriented comparison of speech translation systems. The code and datasets are available at https://github.com/sjtuayj/OpenSTBench.