VisionReasoner: 強化学習による統合的な視覚知覚と推論

要旨

大規模な視覚言語モデルは、多様な視覚知覚タスクを処理する本質的な能力を有しています。本論文では、VisionReasonerを紹介します。これは、共有モデル内で複数の視覚知覚タスクを推論し解決する統一フレームワークです。具体的には、新しいマルチオブジェクト認知学習戦略と体系的なタスク再構成を設計することで、VisionReasonerは視覚入力を分析する推論能力を強化し、多様な知覚タスクを統一フレームワークで扱います。このモデルは、ユーザーのクエリに応答する前に、構造化された推論プロセスを生成します。統一された視覚知覚能力を厳密に評価するため、VisionReasonerを検出、セグメンテーション、カウントという3つの重要な領域にまたがる10の多様なタスクで評価しました。実験結果は、VisionReasonerが統一モデルとして優れた性能を発揮し、COCO（検出）でQwen2.5VLに対して29.1%、ReasonSeg（セグメンテーション）で22.1%、CountBench（カウント）で15.3%の相対的な差で上回ることを示しています。

English

Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing novel multi-object cognitive learning strategies and systematic task reformulation, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks in a unified framework. The model generates a structured reasoning process before delivering the desired outputs responding to user queries. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that VisionReasoner achieves superior performance as a unified model, outperforming Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg (segmentation), and 15.3% on CountBench (counting).

VisionReasoner: 強化学習による統合的な視覚知覚と推論

VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning

要旨

Support