OmniFusion：通过模块化融合实现同步多语言多模态翻译

摘要

开源纯文本翻译大语言模型（LLMs）在语言覆盖范围和质量方面已取得显著进展，但这些模型仅能通过级联流水线应用于语音翻译（ST），即先进行自动语音识别再进行翻译。这种方式会引入额外延迟，在同步语音翻译（SimulST）场景中尤为关键，且无法利用多模态上下文（如图像）进行歧义消解。预训练多模态基础模型（MMFMs）虽已具备跨模态的强感知与推理能力，但通常缺乏专用翻译LLMs的多语言覆盖能力和专业翻译性能。为构建高效的多模态翻译系统，我们提出一种端到端方法，将MMFMs与翻译LLMs相融合。通过创新性融合策略，将预训练MMFM多个隐藏层状态连接至翻译LLM，实现联合端到端训练。基于Omni 2.5-7B作为MMFM、SeedX PPO-7B作为翻译LLM构建的OmniFusion模型，可实现语音到文本、语音图像到文本及图文到文本的翻译。实验表明，OmniFusion能有效利用音频与视觉输入，在SimulST中较级联流水线降低1秒延迟，并提升整体翻译质量。代码已发布于https://github.com/saikoneru/OmniFusion。

English

There has been significant progress in open-source text-only translation large language models (LLMs) with better language coverage and quality. However, these models can be only used in cascaded pipelines for speech translation (ST), performing automatic speech recognition first followed by translation. This introduces additional latency, which is particularly critical in simultaneous ST (SimulST), and prevents the model from exploiting multimodal context, such as images, which can aid disambiguation. Pretrained multimodal foundation models (MMFMs) already possess strong perception and reasoning capabilities across multiple modalities, but generally lack the multilingual coverage and specialized translation performance of dedicated translation LLMs. To build an effective multimodal translation system, we propose an end-to-end approach that fuses MMFMs with translation LLMs. We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM, enabling joint end-to-end training. The resulting model, OmniFusion, built on Omni 2.5-7B as the MMFM and SeedX PPO-7B as the translation LLM, can perform speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation. Experiments demonstrate that OmniFusion effectively leverages both audio and visual inputs, achieves a 1-second latency reduction in SimulST compared to cascaded pipelines and also improves the overall translation qualityCode is available at https://github.com/saikoneru/OmniFusion.

OmniFusion：通过模块化融合实现同步多语言多模态翻译

OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion

摘要

Support