ChatPaper.aiChatPaper

OmniFusion:基于模块化融合的多语言多模态同步翻译系统

OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion

November 28, 2025
作者: Sai Koneru, Matthias Huck, Jan Niehues
cs.AI

摘要

开源纯文本翻译大语言模型(LLM)在语言覆盖范围和质量方面已取得显著进展。然而,这些模型仅能通过级联管道应用于语音翻译(ST),即先进行自动语音识别再进行文本翻译。这种方式会引入额外延迟——在同步语音翻译(SimulST)中尤为关键,且无法利用多模态上下文(如图像)进行歧义消解。预训练多模态基础模型(MMFM)虽已具备跨模态的强感知推理能力,但通常缺乏专用翻译LLM的多语言覆盖能力和专业翻译性能。为构建高效的多模态翻译系统,我们提出一种端到端方法,将MMFM与翻译LLM相融合。通过创新性融合策略,将预训练MMFM多层隐藏状态连接至翻译LLM,实现联合端到端训练。基于Omni 2.5-7B作为MMFM、SeedX PPO-7B作为翻译LLM构建的OmniFusion模型,可实现语音到文本、语音加图像到文本、文本加图像到文本的翻译功能。实验表明,OmniFusion能有效利用音频与视觉输入,在SimulST中较级联管道降低1秒延迟,并提升整体翻译质量。代码已发布于https://github.com/saikoneru/OmniFusion。
English
There has been significant progress in open-source text-only translation large language models (LLMs) with better language coverage and quality. However, these models can be only used in cascaded pipelines for speech translation (ST), performing automatic speech recognition first followed by translation. This introduces additional latency, which is particularly critical in simultaneous ST (SimulST), and prevents the model from exploiting multimodal context, such as images, which can aid disambiguation. Pretrained multimodal foundation models (MMFMs) already possess strong perception and reasoning capabilities across multiple modalities, but generally lack the multilingual coverage and specialized translation performance of dedicated translation LLMs. To build an effective multimodal translation system, we propose an end-to-end approach that fuses MMFMs with translation LLMs. We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM, enabling joint end-to-end training. The resulting model, OmniFusion, built on Omni 2.5-7B as the MMFM and SeedX PPO-7B as the translation LLM, can perform speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation. Experiments demonstrate that OmniFusion effectively leverages both audio and visual inputs, achieves a 1-second latency reduction in SimulST compared to cascaded pipelines and also improves the overall translation qualityCode is available at https://github.com/saikoneru/OmniFusion.
PDF01December 3, 2025