SimulU:長語音同步語音翻譯的免訓練策略
SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation
March 11, 2026
作者: Amirbek Djanibekov, Luisa Bentivogli, Matteo Negri, Sara Papi
cs.AI
摘要
同步語音到語音翻譯(SimulS2S)對於實現即時多語言溝通至關重要,正日益廣泛地整合至會議與串流平台中。儘管如此,SimulS2S在學術研究中仍屬探索不足的領域,現有解決方案往往依賴資源密集的訓練流程,且僅能處理預先分割的短語句,無法泛化應用於連續語音。為彌合此差距,我們提出首個免訓練的長格式SimulS2S策略SimulU。該方法採用歷史管理與語音輸出選擇策略,透過利用預訓練端到端模型中的交叉注意力機制,同步調控輸入歷史與輸出生成。在MuST-C資料集涵蓋的8種語言測試表明,SimulU相較於強基準級聯模型,能實現更優或相當的質量-延遲平衡。由於無需專門訓練,SimulU為在真實長場景中實現端到端SimulS2S開闢了可行路徑。
English
Simultaneous speech-to-speech translation (SimulS2S) is essential for real-time multilingual communication, with increasing integration into meeting and streaming platforms. Despite this, SimulS2S remains underexplored in research, where current solutions often rely on resource-intensive training procedures and operate on short-form, pre-segmented utterances, failing to generalize to continuous speech. To bridge this gap, we propose SimulU, the first training-free policy for long-form SimulS2S. SimulU adopts history management and speech output selection strategies that exploit cross-attention in pre-trained end-to-end models to regulate both input history and output generation. Evaluations on MuST-C across 8 languages show that SimulU achieves a better or comparable quality-latency trade-off against strong cascaded models. By eliminating the need for ad-hoc training, SimulU offers a promising path to end-to-end SimulS2S in realistic, long-form scenarios.