Stream-Omni: 大規模言語-視覚-音声モデルを用いた同時多モーダルインタラクション

要旨

GPT-4oのような大規模マルチモーダルモデル（LMM）の登場により、テキスト、視覚、音声のモダリティを統合し、より柔軟なマルチモーダルインタラクションをサポートするための探求が進んでいます。既存のLMMは、通常、モダリティの表現をシーケンス次元に沿って連結し、それを大規模言語モデル（LLM）のバックボーンに入力します。シーケンス次元の連結はモダリティ統合において直感的ですが、モダリティのアラインメントを学習するために大規模なデータに依存することが多いです。本論文では、モダリティ間の関係をより意図的にモデル化し、それによってより効率的で柔軟なモダリティアラインメントを実現することを目指します。そのために、効率的なモダリティアラインメントを備えた大規模言語-視覚-音声モデルであるStream-Omniを提案します。Stream-Omniは、様々なモダリティの組み合わせ下でのインタラクションを同時にサポートすることができます。Stream-OmniはLLMをバックボーンとして使用し、視覚と音声をテキストに基づいてアラインメントします。テキストと意味的に補完的な視覚については、シーケンス次元の連結を使用して視覚-テキストアラインメントを実現します。テキストと意味的に一貫性のある音声については、CTCベースのレイヤー次元マッピングを導入して音声-テキストアラインメントを実現します。これにより、Stream-Omniはより少ないデータ（特に音声）でモダリティアラインメントを達成し、テキストの能力を他のモダリティに転移させることができます。様々なベンチマークでの実験により、Stream-Omniが視覚理解、音声インタラクション、視覚に基づく音声インタラクションタスクにおいて優れた性能を発揮することが示されています。レイヤー次元マッピングのおかげで、Stream-Omniは音声インタラクション中に中間テキスト出力（ASR転写やモデルの応答など）を同時に提供し、ユーザーに包括的なマルチモーダル体験を提供します。

English

The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.

Stream-Omni: 大規模言語-視覚-音声モデルを用いた同時多モーダルインタラクション

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

要旨

Support