Stream-Omni: 대규모 언어-시각-음성 모델과의 동시 다중 모달 상호작용

초록

GPT-4o와 같은 대규모 다중모달 모델(LMMs)의 등장은 텍스트, 시각, 음성 모달리티를 통합하여 더 유연한 다중모달 상호작용을 지원하기 위한 탐구를 촉진시켰다. 기존의 LMM들은 일반적으로 모달리티의 표현을 시퀀스 차원에서 연결하고 이를 대규모 언어 모델(LLM) 백본에 입력한다. 시퀀스 차원 연결은 모달리티 통합에 직관적이지만, 모달리티 정렬을 학습하기 위해 대규모 데이터에 크게 의존하는 경향이 있다. 본 논문에서는 모달리티 간의 관계를 보다 의도적으로 모델링함으로써 더 효율적이고 유연한 모달리티 정렬을 달성하고자 한다. 이를 위해 우리는 다양한 모달리티 조합에서의 상호작용을 동시에 지원할 수 있는 효율적인 모달리티 정렬을 갖춘 대규모 언어-시각-음성 모델인 Stream-Omni를 제안한다. Stream-Omni는 LLM을 백본으로 사용하며, 시각과 음성을 텍스트와의 관계에 기반하여 정렬한다. 텍스트와 의미적으로 보완적인 시각의 경우, Stream-Omni는 시퀀스 차원 연결을 사용하여 시각-텍스트 정렬을 달성한다. 텍스트와 의미적으로 일관된 음성의 경우, Stream-Omni는 CTC 기반의 레이어 차원 매핑을 도입하여 음성-텍스트 정렬을 달성한다. 이러한 방식으로 Stream-Omni는 더 적은 데이터(특히 음성)로 모달리티 정렬을 달성할 수 있으며, 텍스트 능력을 다른 모달리티로 전이할 수 있다. 다양한 벤치마크에서의 실험 결과, Stream-Omni는 시각 이해, 음성 상호작용, 시각 기반 음성 상호작용 과제에서 강력한 성능을 보여준다. 레이어 차원 매핑 덕분에 Stream-Omni는 음성 상호작용 중간에 ASR 전사 및 모델 응답과 같은 중간 텍스트 출력을 동시에 제공할 수 있어 사용자에게 포괄적인 다중모달 경험을 제공한다.

English

The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.

Stream-Omni: 대규모 언어-시각-음성 모델과의 동시 다중 모달 상호작용

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

초록

Support