SimulU：長語音同步語音翻譯的免訓練策略

摘要

同步語音到語音翻譯（SimulS2S）對於實現即時多語言溝通至關重要，正日益廣泛地整合至會議與串流平台中。儘管如此，SimulS2S在學術研究中仍屬探索不足的領域，現有解決方案往往依賴資源密集的訓練流程，且僅能處理預先分割的短語句，無法泛化應用於連續語音。為彌合此差距，我們提出首個免訓練的長格式SimulS2S策略SimulU。該方法採用歷史管理與語音輸出選擇策略，透過利用預訓練端到端模型中的交叉注意力機制，同步調控輸入歷史與輸出生成。在MuST-C資料集涵蓋的8種語言測試表明，SimulU相較於強基準級聯模型，能實現更優或相當的質量-延遲平衡。由於無需專門訓練，SimulU為在真實長場景中實現端到端SimulS2S開闢了可行路徑。

English

Simultaneous speech-to-speech translation (SimulS2S) is essential for real-time multilingual communication, with increasing integration into meeting and streaming platforms. Despite this, SimulS2S remains underexplored in research, where current solutions often rely on resource-intensive training procedures and operate on short-form, pre-segmented utterances, failing to generalize to continuous speech. To bridge this gap, we propose SimulU, the first training-free policy for long-form SimulS2S. SimulU adopts history management and speech output selection strategies that exploit cross-attention in pre-trained end-to-end models to regulate both input history and output generation. Evaluations on MuST-C across 8 languages show that SimulU achieves a better or comparable quality-latency trade-off against strong cascaded models. By eliminating the need for ad-hoc training, SimulU offers a promising path to end-to-end SimulS2S in realistic, long-form scenarios.

SimulU：長語音同步語音翻譯的免訓練策略

SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation

摘要

Support