V-Thinker:基於圖像的互動式思考
V-Thinker: Interactive Thinking with Images
November 6, 2025
作者: Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Honggang Zhang
cs.AI
摘要
賦予大型多模態模型深度整合圖像互動與長程推理能力,始終是該領域長期存在的挑戰。近期以視覺為核心的推理研究探索出一種極具前景的「以圖思考」範式,標誌著從圖像輔助推理到圖像互動思維的轉變。儘管這一里程碑使模型能聚焦於細粒度圖像區域,但受限於有限的視覺工具空間與任務導向的工作流設計,相關進展仍面臨瓶頸。為此,我們提出通用型多模態推理助手V-Thinker,透過端到端強化學習實現互動式視覺核心思維。該框架包含兩大核心組件:(1)數據演化飛輪——沿多樣性、質量與難度三個維度自動合成、演進並驗證互動式推理數據集;(2)視覺漸進訓練課程——先透過點級監督實現感知對齊,再經兩階段強化學習框架融合互動推理。此外,我們推出專家驗證的基準測試集VTBench,專門針對視覺核心的互動推理任務。大量實驗表明,V-Thinker在通用推理與互動推理場景中均持續超越基於大型多模態模型的強基線,為推進圖像互動推理應用提供了重要啟示。
English
Empowering Large Multimodal Models (LMMs) to deeply integrate image
interaction with long-horizon reasoning capabilities remains a long-standing
challenge in this field. Recent advances in vision-centric reasoning explore a
promising "Thinking with Images" paradigm for LMMs, marking a shift from
image-assisted reasoning to image-interactive thinking. While this milestone
enables models to focus on fine-grained image regions, progress remains
constrained by limited visual tool spaces and task-specific workflow designs.
To bridge this gap, we present V-Thinker, a general-purpose multimodal
reasoning assistant that enables interactive, vision-centric thinking through
end-to-end reinforcement learning. V-Thinker comprises two key components: (1)
a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies
interactive reasoning datasets across three dimensions-diversity, quality, and
difficulty; and (2) a Visual Progressive Training Curriculum that first aligns
perception via point-level supervision, then integrates interactive reasoning
through a two-stage reinforcement learning framework. Furthermore, we introduce
VTBench, an expert-verified benchmark targeting vision-centric interactive
reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently
outperforms strong LMM-based baselines in both general and interactive
reasoning scenarios, providing valuable insights for advancing
image-interactive reasoning applications.