Stream-Omni：與大型語言-視覺-語音模型進行同步多模態互動

摘要

GPT-4o類大型多模態模型（LMMs）的出現，推動了整合文本、視覺和語音模態以支持更靈活多模態交互的探索。現有的LMMs通常沿序列維度連接各模態的表徵，並將其輸入大型語言模型（LLM）骨幹中。雖然序列維度連接對於模態整合來說直觀易行，但它往往嚴重依賴大規模數據來學習模態對齊。本文旨在更有針對性地建模模態間的關係，從而實現更高效和靈活的模態對齊。為此，我們提出了Stream-Omni，這是一個具有高效模態對齊能力的大型語言-視覺-語音模型，能夠同時支持多種模態組合下的交互。Stream-Omni採用LLM作為骨幹，並根據視覺和語音與文本的關係進行對齊。對於在語義上與文本互補的視覺，Stream-Omni使用序列維度連接來實現視覺-文本對齊。對於在語義上與文本一致的語音，Stream-Omni引入了基於CTC的層維度映射來實現語音-文本對齊。通過這種方式，Stream-Omni能夠以更少的數據（尤其是語音數據）實現模態對齊，從而將文本能力遷移到其他模態。在多個基準測試上的實驗表明，Stream-Omni在視覺理解、語音交互以及基於視覺的語音交互任務中表現出色。得益於層維度映射，Stream-Omni在語音交互過程中能夠同時提供中間文本輸出（如ASR轉錄和模型響應），為用戶提供全面的多模態體驗。

English

The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.

Stream-Omni：與大型語言-視覺-語音模型進行同步多模態互動

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

摘要

Support