SimulU: 長文同時音声翻訳のためのトレーニング不要ポリシー

要旨

同時音声翻訳（SimulS2S）は、多言語間のリアルタイムコミュニケーションに不可欠であり、会議やストリーミングプラットフォームへの統合が進んでいる。しかしながら、SimulS2Sの研究は未だ十分に進んでおらず、現行のソリューションはリソース集約的な訓練プロセスに依存することが多く、短い事前分割された発話を対象としており、連続音声への汎化が困難である。この課題を解決するため、我々は長文SimulS2S向けの初の訓練不要なポリシーであるSimulUを提案する。SimulUは、入力履歴と出力生成を制御するために、事前学習済みエンドツーエンドモデルのクロスアテンションを活用する、履歴管理と音声出力選択戦略を採用する。8言語のMuST-Cによる評価では、SimulUが強力なカスケードモデルと比較して、品質と遅延のトレードオフにおいて同等以上の性能を達成することを示した。特化した訓練を不要とするSimulUは、現実的な長文シナリオにおけるエンドツーエンドSimulS2Sの実現への有望な道筋を提供する。

English

Simultaneous speech-to-speech translation (SimulS2S) is essential for real-time multilingual communication, with increasing integration into meeting and streaming platforms. Despite this, SimulS2S remains underexplored in research, where current solutions often rely on resource-intensive training procedures and operate on short-form, pre-segmented utterances, failing to generalize to continuous speech. To bridge this gap, we propose SimulU, the first training-free policy for long-form SimulS2S. SimulU adopts history management and speech output selection strategies that exploit cross-attention in pre-trained end-to-end models to regulate both input history and output generation. Evaluations on MuST-C across 8 languages show that SimulU achieves a better or comparable quality-latency trade-off against strong cascaded models. By eliminating the need for ad-hoc training, SimulU offers a promising path to end-to-end SimulS2S in realistic, long-form scenarios.

SimulU: 長文同時音声翻訳のためのトレーニング不要ポリシー

SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation

要旨

Support