ChatPaper.aiChatPaper

听觉翻译:语音模态整合进大型语言模型的有效性研究

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

December 18, 2025
作者: Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle
cs.AI

摘要

随着大型语言模型(LLM)的应用范畴突破文本领域,将语音作为原生模态进行整合催生了SpeechLLM模型。这类模型旨在直接翻译口语,从而绕开传统的基于转写的处理流程。然而,这种整合是否比成熟的级联架构更能提升语音到文本的翻译质量,仍是待解之谜。我们提出"Hearing to Translate"——首个全面测试框架,通过严格基准测试将5种前沿SpeechLLM模型与16个强效的直接/级联系统进行对比,后者融合了领先的语音基础模型(SFM)与多语言LLM。我们的分析涵盖16个基准数据集、13种语言对和9种挑战性场景(包括不连贯语音、含噪语音及长语音)。在这项广泛评估中,我们发现级联系统整体上仍是最可靠的方案,而当前SpeechLLM仅在特定场景下与级联系统表现相当,SFM则落后于两者。这凸显出无论是模型内部整合还是流程管道整合,引入LLM对实现高质量语音翻译都至关重要。
English
As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.
PDF71December 20, 2025