OpenSTBench: 音声翻訳における意味評価を超えて

要旨

音声翻訳システムは、音声-テキスト翻訳（S2TT）、音声-音声翻訳（S2ST）、オフライン翻訳、およびストリーミング生成にますます拡大しており、モダリティ、音声実現、およびタイミング動作において異なる出力を生成する。既存の評価手法では、翻訳品質、音声品質、時間的品質などの重要な側面を評価するが、これらの側面は個別のプロトコルで評価されることが多く、異種システムを包括的に比較することが困難である。このギャップに対処するため、我々はOpenSTBenchを提案する。これは、異種の音声翻訳出力を共通の評価形式に整理する統一された多次元評価フレームワークである。OpenSTBenchは、オフラインおよびストリーミング設定におけるS2TTおよびS2STシステムの両方をサポートし、翻訳品質、音声品質、話者保存性、感情およびパラ言語的忠実度、時間的一貫性、および遅延を統合的に評価する。代表的な音声翻訳システムを用いた実験を通じて、翻訳品質が高いシステムであっても、音声品質や時間的品質において大きく異なる場合があることを示す。OpenSTBenchは、これらの次元間の差異を分析し、音声翻訳システムの応用指向の比較を支援するための再現可能なプロトコルを提供する。コードとデータセットはhttps://github.com/sjtuayj/OpenSTBenchで入手可能である。

English

Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (S2ST), offline translation, and streaming generation, producing outputs that differ in modality, speech realization, and timing behavior. Existing evaluation practices assess important aspects such as translation quality, speech quality, and temporal quality, but these aspects are often evaluated under separate protocols, making it difficult to compare heterogeneous systems comprehensively. To address this gap, we present OpenSTBench, a unified multidimensional evaluation framework that organizes heterogeneous speech translation outputs into a shared evaluation format. OpenSTBench supports both S2TT and S2ST systems in offline and streaming settings, and jointly evaluates translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency. Through experiments on representative speech translation systems, we show that systems with strong translation quality can still differ substantially in speech quality, as well as in temporal quality. OpenSTBench provides a reproducible protocol for analyzing these cross-dimensional differences and supporting application-oriented comparison of speech translation systems. The code and datasets are available at https://github.com/sjtuayj/OpenSTBench.