Qwen2.5-Omni 技术报告
Qwen2.5-Omni Technical Report
March 26, 2025
作者: Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin
cs.AI
摘要
在本報告中,我們介紹了Qwen2.5-Omni,這是一個端到端的多模態模型,旨在感知包括文本、圖像、音頻和視頻在內的多種模態,同時以流式方式生成文本和自然語音回應。為了實現多模態信息輸入的流式處理,音頻和視覺編碼器均採用了分塊處理的方法。為了同步視頻輸入與音頻的時間戳,我們以交錯的方式組織音頻和視頻,並提出了一種新穎的位置嵌入方法,稱為TMRoPE(時間對齊的多模態RoPE)。為了同時生成文本和語音並避免兩種模態之間的干擾,我們提出了Thinker-Talker架構。在此框架中,Thinker作為一個大型語言模型負責文本生成,而Talker則是一個雙軌自回歸模型,直接利用Thinker的隱藏表示來生成音頻標記作為輸出。Thinker和Talker模型均設計為端到端的方式進行訓練和推理。為了以流式方式解碼音頻標記,我們引入了一個滑動窗口的DiT,限制其感受野,旨在減少初始包延遲。Qwen2.5-Omni與同規模的Qwen2.5-VL相當,並優於Qwen2-Audio。此外,Qwen2.5-Omni在多模態基準測試如Omni-Bench上達到了最先進的性能。值得注意的是,Qwen2.5-Omni在端到端語音指令跟隨方面的性能與其在文本輸入上的能力相當,這在MMLU和GSM8K等基準測試中得到了證明。至於語音生成,Qwen2.5-Omni的流式Talker在魯棒性和自然度方面優於大多數現有的流式和非流式替代方案。
English
In this report, we present Qwen2.5-Omni, an end-to-end multimodal model
designed to perceive diverse modalities, including text, images, audio, and
video, while simultaneously generating text and natural speech responses in a
streaming manner. To enable the streaming of multimodal information inputs,
both audio and visual encoders utilize a block-wise processing approach. To
synchronize the timestamps of video inputs with audio, we organize the audio
and video sequentially in an interleaved manner and propose a novel position
embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently
generate text and speech while avoiding interference between the two
modalities, we propose Thinker-Talker architecture. In this framework,
Thinker functions as a large language model tasked with text generation, while
Talker is a dual-track autoregressive model that directly utilizes the hidden
representations from the Thinker to produce audio tokens as output. Both the
Thinker and Talker models are designed to be trained and inferred in an
end-to-end manner. For decoding audio tokens in a streaming manner, we
introduce a sliding-window DiT that restricts the receptive field, aiming to
reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly
sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni
achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench.
Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following
is comparable to its capabilities with text inputs, as evidenced by benchmarks
such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming
Talker outperforms most existing streaming and non-streaming alternatives in
robustness and naturalness.Summary
AI-Generated Summary