Qwen2.5-Omni 技术报告

摘要

在本报告中，我们介绍了Qwen2.5-Omni，这是一个端到端的多模态模型，旨在感知包括文本、图像、音频和视频在内的多种模态，同时以流式方式生成文本和自然语音响应。为了实现多模态信息输入的流式处理，音频和视觉编码器均采用了分块处理的方法。为了同步视频输入与音频的时间戳，我们以交错的方式顺序组织音频和视频，并提出了一种新颖的位置嵌入方法，称为TMRoPE（时间对齐的多模态RoPE）。为了在生成文本和语音的同时避免两种模态之间的干扰，我们提出了Thinker-Talker架构。在该框架中，Thinker作为一个大型语言模型负责文本生成，而Talker则是一个双轨自回归模型，直接利用Thinker的隐藏表示来生成音频标记作为输出。Thinker和Talker模型均设计为端到端的训练和推理。为了以流式方式解码音频标记，我们引入了一种滑动窗口DiT，限制其感受野，旨在减少初始包延迟。Qwen2.5-Omni与同等规模的Qwen2.5-VL相当，并优于Qwen2-Audio。此外，Qwen2.5-Omni在多模态基准测试（如Omni-Bench）上实现了最先进的性能。值得注意的是，Qwen2.5-Omni在端到端语音指令跟随方面的性能与其在文本输入上的能力相当，这一点在MMLU和GSM8K等基准测试中得到了验证。在语音生成方面，Qwen2.5-Omni的流式Talker在鲁棒性和自然度上优于大多数现有的流式和非流式替代方案。

English

In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

Qwen2.5-Omni 技术报告

Qwen2.5-Omni Technical Report

摘要

Support