Qwen2.5-VL 技術報告
Qwen2.5-VL Technical Report
February 19, 2025
作者: Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin
cs.AI
摘要
我們推出Qwen2.5-VL,作為Qwen視覺語言系列的最新旗艦模型,其在基礎能力與創新功能上均展現了顯著進步。Qwen2.5-VL通過增強視覺識別、精確物體定位、強大文檔解析及長視頻理解能力,實現了對世界理解與互動的重大飛躍。該模型的一大亮點在於其能精確使用邊界框或點來定位物體,並能從發票、表格中提取穩健的結構化數據,以及對圖表、圖示和佈局進行詳細分析。為處理複雜輸入,Qwen2.5-VL引入了動態分辨率處理與絕對時間編碼,使其能夠處理不同尺寸的圖像及長達數小時的視頻,並實現秒級事件定位。這讓模型能夠原生感知空間尺度與時間動態,無需依賴傳統的歸一化技術。通過從頭訓練原生動態分辨率的視覺Transformer(ViT)並結合窗口注意力機制,我們在保持原生分辨率的同時降低了計算開銷。因此,Qwen2.5-VL不僅在靜態圖像與文檔理解上表現卓越,還作為一個互動視覺代理,在操作電腦與移動設備等現實場景中具備推理、工具使用及任務執行的能力。Qwen2.5-VL提供三種尺寸,滿足從邊緣AI到高效能計算的多元應用場景。旗艦型號Qwen2.5-VL-72B在文檔與圖示理解方面與GPT-4o、Claude 3.5 Sonnet等頂尖模型相媲美。此外,Qwen2.5-VL保持了強大的語言性能,延續了Qwen2.5大語言模型的核心語言能力。
English
We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language
series, which demonstrates significant advancements in both foundational
capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap
forward in understanding and interacting with the world through enhanced visual
recognition, precise object localization, robust document parsing, and
long-video comprehension. A standout feature of Qwen2.5-VL is its ability to
localize objects using bounding boxes or points accurately. It provides robust
structured data extraction from invoices, forms, and tables, as well as
detailed analysis of charts, diagrams, and layouts. To handle complex inputs,
Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding,
enabling it to process images of varying sizes and videos of extended durations
(up to hours) with second-level event localization. This allows the model to
natively perceive spatial scales and temporal dynamics without relying on
traditional normalization techniques. By training a native dynamic-resolution
Vision Transformer (ViT) from scratch and incorporating Window Attention, we
reduce computational overhead while maintaining native resolution. As a result,
Qwen2.5-VL excels not only in static image and document understanding but also
as an interactive visual agent capable of reasoning, tool usage, and task
execution in real-world scenarios such as operating computers and mobile
devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases
from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model
matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly
excelling in document and diagram understanding. Additionally, Qwen2.5-VL
maintains robust linguistic performance, preserving the core language
competencies of the Qwen2.5 LLM.Summary
AI-Generated Summary