VisionReasoner:基於強化學習的統一視覺感知與推理系統
VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning
May 17, 2025
作者: Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, Jiaya Jia
cs.AI
摘要
大型視覺語言模型展現出處理多樣視覺感知任務的內在能力。本文中,我們介紹了VisionReasoner,這是一個能夠在單一模型內進行推理並解決多種視覺感知任務的統一框架。具體而言,通過設計新穎的多目標認知學習策略和系統化的任務重構,VisionReasoner增強了其分析視覺輸入的推理能力,並在統一框架下應對多樣的感知任務。該模型在響應使用者查詢並輸出所需結果之前,會生成結構化的推理過程。為了嚴格評估統一的視覺感知能力,我們在涵蓋檢測、分割和計數這三個關鍵領域的十項多樣任務上對VisionReasoner進行了評估。實驗結果顯示,VisionReasoner作為統一模型展現出卓越性能,在COCO(檢測)上相對Qwen2.5VL提升了29.1%,在ReasonSeg(分割)上提升了22.1%,在CountBench(計數)上提升了15.3%。
English
Large vision-language models exhibit inherent capabilities to handle diverse
visual perception tasks. In this paper, we introduce VisionReasoner, a unified
framework capable of reasoning and solving multiple visual perception tasks
within a shared model. Specifically, by designing novel multi-object cognitive
learning strategies and systematic task reformulation, VisionReasoner enhances
its reasoning capabilities to analyze visual inputs, and addresses diverse
perception tasks in a unified framework. The model generates a structured
reasoning process before delivering the desired outputs responding to user
queries. To rigorously assess unified visual perception capabilities, we
evaluate VisionReasoner on ten diverse tasks spanning three critical domains:
detection, segmentation, and counting. Experimental results show that
VisionReasoner achieves superior performance as a unified model, outperforming
Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg
(segmentation), and 15.3% on CountBench (counting).Summary
AI-Generated Summary