Qwen2.5-Omni 技术报告

摘要

在本報告中，我們介紹了Qwen2.5-Omni，這是一個端到端的多模態模型，旨在感知包括文本、圖像、音頻和視頻在內的多種模態，同時以流式方式生成文本和自然語音回應。為了實現多模態信息輸入的流式處理，音頻和視覺編碼器均採用了分塊處理的方法。為了同步視頻輸入與音頻的時間戳，我們以交錯的方式組織音頻和視頻，並提出了一種新穎的位置嵌入方法，稱為TMRoPE（時間對齊的多模態RoPE）。為了同時生成文本和語音並避免兩種模態之間的干擾，我們提出了Thinker-Talker架構。在此框架中，Thinker作為一個大型語言模型負責文本生成，而Talker則是一個雙軌自回歸模型，直接利用Thinker的隱藏表示來生成音頻標記作為輸出。Thinker和Talker模型均設計為端到端的方式進行訓練和推理。為了以流式方式解碼音頻標記，我們引入了一個滑動窗口的DiT，限制其感受野，旨在減少初始包延遲。Qwen2.5-Omni與同規模的Qwen2.5-VL相當，並優於Qwen2-Audio。此外，Qwen2.5-Omni在多模態基準測試如Omni-Bench上達到了最先進的性能。值得注意的是，Qwen2.5-Omni在端到端語音指令跟隨方面的性能與其在文本輸入上的能力相當，這在MMLU和GSM8K等基準測試中得到了證明。至於語音生成，Qwen2.5-Omni的流式Talker在魯棒性和自然度方面優於大多數現有的流式和非流式替代方案。

English

In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

Qwen2.5-Omni 技术报告

Qwen2.5-Omni Technical Report

摘要

Support