Stream-Omni：面向大型语言-视觉-语音模型的多模态同步交互

摘要

GPT-4o类大型多模态模型（LMMs）的出现，推动了文本、视觉和语音模态融合的探索，以支持更为灵活的多模态交互。现有的LMMs通常沿序列维度拼接各模态的表示，并将其输入大型语言模型（LLM）骨干网络。尽管序列维度拼接在模态集成上简单直接，但它往往高度依赖大规模数据来学习模态对齐。本文旨在更有针对性地建模模态间关系，从而实现更高效、灵活的模态对齐。为此，我们提出了Stream-Omni，一个具备高效模态对齐能力的大型语言-视觉-语音模型，能够同时支持多种模态组合下的交互。Stream-Omni以LLM为骨干，根据视觉和语音与文本的关系进行对齐。对于在语义上与文本互补的视觉信息，Stream-Omni采用序列维度拼接实现视觉-文本对齐；对于在语义上与文本一致的语音信息，Stream-Omni引入基于CTC的层维度映射实现语音-文本对齐。通过这种方式，Stream-Omni能够以较少数据（尤其是语音数据）实现模态对齐，从而将文本能力迁移至其他模态。在多个基准测试上的实验表明，Stream-Omni在视觉理解、语音交互及基于视觉的语音交互任务中均表现出色。得益于层维度映射，Stream-Omni在语音交互过程中能同时提供中间文本输出（如ASR转录和模型响应），为用户带来全面的多模态体验。

English

The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.

Stream-Omni：面向大型语言-视觉-语音模型的多模态同步交互

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

摘要

Support