SimulU：长时同步语音翻译的无训练策略

摘要

同步语音到语音翻译(SimulS2S)对实时多语言通信至关重要，正日益集成于会议和流媒体平台。然而该技术在研究领域仍待深入，现有方案常依赖资源密集型的训练流程，且仅适用于短片段预分割语音，难以推广至连续语音场景。为弥补这一空白，我们提出首个免训练的长时SimulS2S策略SimulU。该方法通过历史管理与语音输出选择策略，利用预训练端到端模型中的交叉注意力机制来调控输入历史与输出生成。在MuST-C数据集8个语言对的测试表明，SimulU在质量-延迟权衡上优于或媲美强级联模型。通过避免定制化训练需求，SimulU为现实长时场景下的端到端SimulS2S提供了可行路径。

English

Simultaneous speech-to-speech translation (SimulS2S) is essential for real-time multilingual communication, with increasing integration into meeting and streaming platforms. Despite this, SimulS2S remains underexplored in research, where current solutions often rely on resource-intensive training procedures and operate on short-form, pre-segmented utterances, failing to generalize to continuous speech. To bridge this gap, we propose SimulU, the first training-free policy for long-form SimulS2S. SimulU adopts history management and speech output selection strategies that exploit cross-attention in pre-trained end-to-end models to regulate both input history and output generation. Evaluations on MuST-C across 8 languages show that SimulU achieves a better or comparable quality-latency trade-off against strong cascaded models. By eliminating the need for ad-hoc training, SimulU offers a promising path to end-to-end SimulS2S in realistic, long-form scenarios.

SimulU：长时同步语音翻译的无训练策略

SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation

摘要

Support