Qwen2.5-Omni 기술 보고서

초록

본 보고서에서는 텍스트, 이미지, 오디오, 비디오 등 다양한 모달리티를 인지하면서 동시에 스트리밍 방식으로 텍스트와 자연스러운 음성 응답을 생성할 수 있는 종단 간(end-to-end) 멀티모달 모델인 Qwen2.5-Omni를 소개합니다. 멀티모달 정보 입력의 스트리밍을 가능하게 하기 위해 오디오와 비주얼 인코더는 블록 단위 처리 방식을 사용합니다. 비디오 입력의 타임스탬프를 오디오와 동기화하기 위해 오디오와 비디오를 순차적으로 인터리브 방식으로 구성하고, TMRoPE(Time-aligned Multimodal RoPE)라는 새로운 위치 임베딩 접근법을 제안합니다. 텍스트와 음성을 동시에 생성하면서 두 모달리티 간의 간섭을 피하기 위해 Thinker-Talker 아키텍처를 제안합니다. 이 프레임워크에서 Thinker는 텍스트 생성을 담당하는 대형 언어 모델로 기능하며, Talker는 Thinker의 은닉 표현을 직접 활용하여 오디오 토큰을 출력으로 생성하는 듀얼 트랙 자동회귀 모델입니다. Thinker와 Talker 모델 모두 종단 간 방식으로 학습 및 추론되도록 설계되었습니다. 오디오 토큰을 스트리밍 방식으로 디코딩하기 위해 수용 영역을 제한하는 슬라이딩 윈도우 DiT를 도입하여 초기 패키지 지연을 줄이는 것을 목표로 합니다. Qwen2.5-Omni는 유사한 규모의 Qwen2.5-VL과 비슷한 성능을 보이며 Qwen2-Audio를 능가합니다. 또한 Qwen2.5-Omni는 Omni-Bench와 같은 멀티모달 벤치마크에서 최첨단 성능을 달성합니다. 특히, Qwen2.5-Omni의 종단 간 음성 명령 수행 능력은 MMLU 및 GSM8K와 같은 벤치마크에서 입증된 바와 같이 텍스트 입력과 비슷한 수준입니다. 음성 생성 측면에서 Qwen2.5-Omni의 스트리밍 Talker는 대부분의 기존 스트리밍 및 비스트리밍 대안을 견고성과 자연스러움에서 능가합니다.

English

In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

Qwen2.5-Omni 기술 보고서

Qwen2.5-Omni Technical Report

초록

Support