Qwen2.5-Omni技術レポート

要旨

本報告では、テキスト、画像、音声、動画など多様なモダリティを認識しつつ、テキストと自然な音声応答をストリーミング方式で同時生成するエンドツーエンドのマルチモーダルモデルであるQwen2.5-Omniを紹介します。マルチモーダル情報入力のストリーミングを可能にするため、音声と視覚のエンコーダはブロック単位の処理方式を採用しています。動画入力のタイムスタンプを音声と同期させるため、音声と動画を交互に順序立てて配置し、TMRoPE（Time-aligned Multimodal RoPE）という新しい位置埋め込み手法を提案します。テキストと音声の同時生成においてモダリティ間の干渉を回避するため、Thinker-Talkerアーキテクチャを提案します。このフレームワークでは、Thinkerはテキスト生成を担う大規模言語モデルとして機能し、TalkerはThinkerの隠れ表現を直接利用して音声トークンを出力するデュアルトラックの自己回帰モデルです。ThinkerとTalkerの両モデルは、エンドツーエンドで学習および推論可能な設計となっています。音声トークンをストリーミング方式でデコードするため、受容野を制限するスライディングウィンドウDiTを導入し、初期パッケージ遅延の低減を図っています。Qwen2.5-Omniは、同規模のQwen2.5-VLと同等の性能を発揮し、Qwen2-Audioを上回ります。さらに、Omni-Benchなどのマルチモーダルベンチマークにおいて、最先端の性能を達成しています。特に、Qwen2.5-Omniのエンドツーエンド音声指示追従性能は、MMLUやGSM8Kなどのベンチマークで示されるように、テキスト入力に対する能力と同等です。音声生成に関しては、Qwen2.5-OmniのストリーミングTalkerは、既存のストリーミングおよび非ストリーミング方式の代替手法のほとんどを堅牢性と自然さの点で上回っています。

English

In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

Qwen2.5-Omni技術レポート

Qwen2.5-Omni Technical Report

要旨

Support