听觉翻译:语音模态集成于大型语言模型的有效性研究
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
December 18, 2025
作者: Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle
cs.AI
摘要
随着大语言模型(LLM)突破文本范畴,将语音作为原生模态进行整合催生了SpeechLLM。这类模型旨在直接翻译口语,从而绕开传统的基于转写的处理流程。然而,这种整合是否比成熟的级联架构更能提升语音到文本的翻译质量,仍是一个悬而未决的问题。我们推出"Hearing to Translate"——首个全面测试框架,系统性地将5种前沿SpeechLLM与16个结合顶尖语音基础模型(SFM)与多语言LLM的直接/级联强效系统进行基准比较。我们的分析涵盖16个基准数据集、13种语言对和9种挑战性场景(包括不流利语音、含噪语音及长语音)。在这项广泛评估中,我们发现级联系统整体仍最为可靠,而当前SpeechLLM仅在特定场景下与级联系统表现相当,SFM则落后于两者。这表明无论将LLM整合至模型内部还是处理流程中,都是实现高质量语音翻译的关键。
English
As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.