VisionReasoner:通过强化学习实现视觉感知与推理的统一
VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning
May 17, 2025
作者: Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, Jiaya Jia
cs.AI
摘要
大型视觉语言模型展现出处理多样化视觉感知任务的固有能力。本文中,我们介绍了VisionReasoner,一个能够在共享模型中推理并解决多种视觉感知任务的统一框架。具体而言,通过设计新颖的多目标认知学习策略和系统性的任务重构,VisionReasoner增强了其分析视觉输入的推理能力,并在统一框架内应对多样化的感知任务。该模型在响应用户查询输出所需结果之前,会生成结构化的推理过程。为了严格评估统一的视觉感知能力,我们在涵盖检测、分割和计数三个关键领域的十项多样化任务上对VisionReasoner进行了评估。实验结果显示,VisionReasoner作为统一模型表现出色,在COCO(检测)上相对Qwen2.5VL提升了29.1%,在ReasonSeg(分割)上提升了22.1%,在CountBench(计数)上提升了15.3%。
English
Large vision-language models exhibit inherent capabilities to handle diverse
visual perception tasks. In this paper, we introduce VisionReasoner, a unified
framework capable of reasoning and solving multiple visual perception tasks
within a shared model. Specifically, by designing novel multi-object cognitive
learning strategies and systematic task reformulation, VisionReasoner enhances
its reasoning capabilities to analyze visual inputs, and addresses diverse
perception tasks in a unified framework. The model generates a structured
reasoning process before delivering the desired outputs responding to user
queries. To rigorously assess unified visual perception capabilities, we
evaluate VisionReasoner on ten diverse tasks spanning three critical domains:
detection, segmentation, and counting. Experimental results show that
VisionReasoner achieves superior performance as a unified model, outperforming
Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg
(segmentation), and 15.3% on CountBench (counting).Summary
AI-Generated Summary